Cassandra 2.1 Used by DataStax Enterprise 4.7
Cassandra 2.1 Used by DataStax Enterprise 4.7
Cassandra 2.1 Used by DataStax Enterprise 4.7
1
Documentation
July 10, 2015
Contents
CQL.................................................................................................................................. 12
Installing..........................................................................................................................37
Installing on RHEL-based systems................................................................................................... 37
Installing on Debian-based systems................................................................................................. 38
Installing the binary tarball................................................................................................................ 39
Installing on Windows systems......................................................................................................... 40
Installing prior releases..................................................................................................................... 41
Uninstalling Cassandra......................................................................................................................41
Installing on cloud providers............................................................................................................. 42
3
Contents
Initializing a cluster....................................................................................................... 57
Initializing a multiple node cluster (single data center)..................................................................... 57
Initializing a multiple node cluster (multiple data centers)................................................................ 59
Security........................................................................................................................... 63
Securing Cassandra.......................................................................................................................... 63
SSL encryption.................................................................................................................................. 63
Client-to-node encryption........................................................................................................63
Node-to-node encryption........................................................................................................ 64
Using cqlsh with SSL encryption............................................................................................65
Preparing server certificates...................................................................................................66
Internal authentication....................................................................................................................... 67
Internal authentication............................................................................................................ 67
Configuring authentication...................................................................................................... 67
Logging in using cqlsh............................................................................................................68
Internal authorization......................................................................................................................... 68
Object permissions................................................................................................................. 69
Configuring internal authorization...........................................................................................69
Configuring firewall port access........................................................................................................ 70
Enabling JMX authentication.............................................................................................................71
Database internals......................................................................................................... 73
Storage engine.................................................................................................................................. 73
Separate table directories................................................................................................................. 73
Cassandra storage basics.................................................................................................................73
The write path to compaction.................................................................................................74
How Cassandra stores indexes............................................................................................. 76
About index updates...............................................................................................................77
The write path of an update............................................................................................................. 77
About deletes.....................................................................................................................................77
About hinted handoff writes.............................................................................................................. 77
Reads.................................................................................................................................................80
About reads............................................................................................................................ 80
How off-heap components affect reads................................................................................. 81
Reading from a partition.........................................................................................................82
How write patterns affect reads............................................................................................. 82
How the row cache affects reads.......................................................................................... 82
About transactions and concurrency control.....................................................................................82
Atomicity..................................................................................................................................83
Consistency.............................................................................................................................83
Isolation................................................................................................................................... 83
Durability................................................................................................................................. 84
Lightweight transactions......................................................................................................... 84
Data consistency............................................................................................................................... 85
About data consistency.......................................................................................................... 85
About built-in consistency repair features.............................................................................. 85
4
Contents
Configuration.................................................................................................................. 95
cassandra.yaml configuration file...................................................................................................... 95
Configuring gossip settings............................................................................................................. 111
Configuring the heap dump directory..............................................................................................112
Configuring virtual nodes.................................................................................................................113
Enabling virtual nodes on a new cluster.............................................................................. 113
Enabling virtual nodes on an existing production cluster..................................................... 113
Using multiple network interfaces................................................................................................... 114
Configuring logging..........................................................................................................................116
Commit log archive configuration....................................................................................................118
Generating tokens........................................................................................................................... 119
Hadoop support............................................................................................................................... 120
Operations.................................................................................................................... 123
Monitoring Cassandra..................................................................................................................... 123
Monitoring a Cassandra cluster............................................................................................123
Tuning Bloom filters........................................................................................................................ 128
Data caching....................................................................................................................................128
Configuring data caches.......................................................................................................128
Monitoring and adjusting caching.........................................................................................131
Configuring memtable throughput................................................................................................... 132
Configuring compaction................................................................................................................... 132
Compression.................................................................................................................................... 133
When to compress data....................................................................................................... 134
Configuring compression...................................................................................................... 134
Testing compaction and compression.............................................................................................135
Tuning Java resources.................................................................................................................... 135
Purging gossip state on a node......................................................................................................137
Repairing nodes.............................................................................................................................. 138
Adding or removing nodes, data centers, or clusters..................................................................... 142
Adding nodes to an existing cluster..................................................................................... 142
Adding a data center to a cluster.........................................................................................144
Replacing a dead node or dead seed node........................................................................ 145
Replacing a running node.................................................................................................... 147
Moving a node from one rack to another.............................................................................148
Decommissioning a data center........................................................................................... 148
Removing a node................................................................................................................. 148
Switching snitches................................................................................................................ 149
Edge cases for transitioning or migrating a cluster..............................................................150
Adding or replacing single-token nodes............................................................................... 151
5
Contents
6
Contents
setcachecapacity...................................................................................................................202
setcachekeystosave.............................................................................................................. 202
setcompactionthreshold........................................................................................................ 203
setcompactionthroughput...................................................................................................... 204
sethintedhandoffthrottlekb..................................................................................................... 205
setlogginglevel...................................................................................................................... 205
setstreamthroughput............................................................................................................. 206
settraceprobability................................................................................................................. 207
snapshot................................................................................................................................208
status.....................................................................................................................................210
statusbackup......................................................................................................................... 211
statusbinary........................................................................................................................... 212
statusgossip.......................................................................................................................... 212
statushandoff.........................................................................................................................213
statusthrift..............................................................................................................................213
stop....................................................................................................................................... 214
stopdaemon.......................................................................................................................... 214
tpstats....................................................................................................................................215
truncatehints..........................................................................................................................219
upgradesstables.................................................................................................................... 219
version...................................................................................................................................220
Cassandra bulk loader (sstableloader)........................................................................................... 221
The cassandra utility....................................................................................................................... 223
The cassandra-stress tool............................................................................................................... 226
Using the Daemon Mode..................................................................................................... 230
Interpreting the output of cassandra-stress..........................................................................231
The sstablescrub utility....................................................................................................................232
The sstablesplit utility...................................................................................................................... 233
The sstablekeys utility..................................................................................................................... 234
The sstableupgrade tool..................................................................................................................234
References.................................................................................................................... 236
Starting and stopping Cassandra....................................................................................................236
Starting Cassandra as a service.......................................................................................... 236
Starting Cassandra as a stand-alone process..................................................................... 236
Stopping Cassandra as a service........................................................................................ 236
Stopping Cassandra as a stand-alone process................................................................... 237
Clearing the data as a service............................................................................................. 237
Clearing the data as a stand-alone process........................................................................ 237
Install locations................................................................................................................................ 238
Tarball installation directories............................................................................................... 238
Package installation directories............................................................................................ 238
Cassandra include file..................................................................................................................... 239
Cassandra-CLI utility (deprecated)..................................................................................................239
Table attributes..................................................................................................................... 240
Troubleshooting........................................................................................................... 244
Peculiar Linux kernel performance problem on NUMA systems.....................................................244
Reads are getting slower while writes are still fast.........................................................................244
Nodes seem to freeze after some period of time........................................................................... 244
Nodes are dying with OOM errors..................................................................................................245
7
Contents
Release notes...............................................................................................................250
8
About Apache Cassandra
Cassandra is a Cassandra's architecture allows any authorized user to connect to any node
partitioned row store in any data center and access data using the CQL language. For ease of
database use, CQL uses a similar syntax to SQL. The most basic way to interact
with Cassandra is using the CQL shell, cqlsh. Using cqlsh, you can create
keyspaces and tables, insert and query tables, plus much more. If you prefer
a graphical tool, you can use DataStax DevCenter. For production, DataStax
supplies a number drivers so that CQL statements can be passed from client
to cluster and back. Other administrative tasks can be accomplished using
OpsCenter.
Automatic data Cassandra provides automatic data distribution across all nodes that
distribution participate in a ring or database cluster. There is nothing programmatic that
a developer or administrator needs to do or code to distribute data across a
cluster because data is transparently partitioned across all nodes in a cluster.
Built-in and Cassandra also provides built-in and customizable replication, which stores
customizable redundant copies of data across nodes that participate in a Cassandra ring.
replication This means that if any node in a cluster goes down, one or more copies of that
node’s data is available on other machines in the cluster. Replication can be
configured to work across one data center, many data centers, and multiple
cloud availability zones.
Cassandra supplies Cassandra supplies linear scalability, meaning that capacity may be easily
linear scalability added simply by adding new nodes online. For example, if 2 nodes can handle
100,000 transactions per second, 4 nodes will support 200,000 transactions/
sec and 8 nodes will tackle 400,000 transactions/sec:
9
About Apache Cassandra
10
What's new in Cassandra 2.1
11
CQL
CQL
Cassandra Query Language (CQL) is the default and primary interface into the Cassandra DBMS.
Cassandra Query Language (CQL) is the default and primary interface into the Cassandra DBMS. Using
CQL is similar to using SQL (Structured Query Language). CQL and SQL share the same abstract idea
of a table constructed of columns and rows. The main difference from SQL is that Cassandra does not
support joins or subqueries. Instead, Cassandra emphasizes denormalization through CQL features like
collections and clustering specified at the schema level.
CQL is the recommended way to interact with Cassandra. Performance and the simplicity of reading and
using CQL is an advantage of modern Cassandra over older Cassandra APIs.
The CQL documentation contains a data modeling section, examples, and command reference. The cqlsh
utility for using CQL interactively on the command line is also covered.
12
Understanding the architecture
Architecture in brief
Essential information for understanding and using Cassandra.
Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure.
Its architecture is based on the understanding that system and hardware failures can and do occur.
Cassandra addresses the problem of failures by employing a peer-to-peer distributed system across
homogeneous nodes where data is distributed among all nodes in the cluster. Each node exchanges
information across the cluster every second. A sequentially written commit log on each node captures
write activity to ensure data durability. Data is then indexed and written to an in-memory structure, called
a memtable, which resembles a write-back cache. Once the memory structure is full, the data is written to
disk in an SSTable data file. All writes are automatically partitioned and replicated throughout the cluster.
Using a process called compaction Cassandra periodically consolidates SSTables, discarding obsolete
data and tombstones (an indicator that data was deleted).
Cassandra is a row-oriented database. Cassandra's architecture allows any authorized user to connect
to any node in any data center and access data using the CQL language. For ease of use, CQL uses
a similar syntax to SQL. From the CQL perspective the database consists of tables. Typically, a cluster
has one keyspace per application. Developers can access CQL through cqlsh as well as via drivers for
application languages.
Client read or write requests can be sent to any node in the cluster. When a client connects to a node with
a request, that node serves as the coordinator for that particular client operation. The coordinator acts as
a proxy between the client application and the nodes that own the data being requested. The coordinator
determines which nodes in the ring should get the request based on how the cluster is configured. For
more information, see Client requests.
Key structures
• Node
Where you store your data. It is the basic infrastructure component of Cassandra.
• Data center
A collection of related nodes. A data center can be a physical data center or virtual data center.
Different workloads should use separate data centers, either physical or virtual. Replication is set by
data center. Using separate data centers prevents Cassandra transactions from being impacted by
other workloads and keeps requests close to each other for lower latency. Depending on the replication
factor, data can be written to multiple data centers. However, data centers should never span physical
locations.
• Cluster
A cluster contains one or more data centers. It can span physical locations.
• Commit log
All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it
can be archived, deleted, or recycled.
• Table
A collection of ordered columns fetched by row. A row consists of columns and have a primary key. The
first part of the key is a column name.
• SSTable
13
Understanding the architecture
A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables
periodically. SSTables are append only and stored on disk sequentially and maintained for each
Cassandra table.
14
Understanding the architecture
15
Understanding the architecture
Node failures can result from various causes such as hardware failures and network outages. Node
outages are often transient but can last for extended periods. Because a node outage rarely signifies a
permanent departure from the cluster it does not automatically result in permanent removal of the node
from the ring. Other nodes will periodically try to re-establish contact with failed nodes to see if they are
back up. To permanently change a node's membership in a cluster, administrators must explicitly add or
remove nodes from a Cassandra cluster using the nodetool utility or OpsCenter.
When a node comes back online after an outage, it may have missed writes for the replica data
it maintains. Once the failure detector marks a node as down, missed writes are stored by other
replicas for a period of time providing hinted handoff is enabled. If a node is down for longer than
max_hint_window_in_ms (3 hours by default), hints are no longer saved. Nodes that die may have stored
undelivered hints. Run a repair after recovering a node that has been down for an extended period.
Moreover, you should routinely run nodetool repair on all nodes to ensure they have consistent data.
For more explanation about hint storage, see Modern hinted handoff.
Consistent hashing
Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are
added or removed.
Consistent hashing allows distribution of data across a cluster to minimize reorganization when nodes are
added or removed. Consistent hashing partitions data based on the partition key. (For an explanation of
partition keys and primary keys, see the Data modeling example in CQL for Cassandra 2.0.)
For example, if you have the following data:
16
Understanding the architecture
Each node in the cluster is responsible for a range of data based on the hash value:
Hash values in a 4 node cluster
- 9223372036854775808
to
A
- 4611686018427387903
- 4611686018427387904
to
-1
4611686018427387904
to
9223372036854775807 C
-1
to
4611686018427387903
Cassandra places the data on each node according to the value of the partition key and the range that
the node is responsible for. For example, in a four node cluster, the data in this example is distributed as
follows:
Virtual nodes
Overview of virtual nodes (vnodes).
Vnodes simplify many tasks in Cassandra:
• You no longer have to calculate and assign tokens to each node.
• Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the
cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a
node fails, the load is spread evenly across other nodes in the cluster.
• Rebuilding a dead node is faster because it involves every other node in the cluster.
• Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of
vnodes to smaller and larger machines.
For more information, see the article Virtual nodes in Cassandra 1.2 and Enabling virtual nodes on an
existing production cluster.
To convert an existing cluster to vnodes, see Enabling virtual nodes on an existing production cluster.
17
Understanding the architecture
The top portion of the graphic shows a cluster without vnodes. In this paradigm, each node is assigned
a single token that represents a location in the ring. Each node stores data determined by mapping the
partition key to a token value within a range from the previous node to its assigned value. Each node also
contains copies of each row from other nodes in the cluster. For example, range E replicates to nodes 5, 6,
and 1. Notice that a node owns exactly one contiguous partition range in the ring space.
The bottom portion of the graphic shows a ring with vnodes. Within a cluster, virtual nodes are randomly
selected and non-contiguous. The placement of a row is determined by the hash of the partition key within
many smaller partition ranges belonging to each node.
Data replication
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy
determines the nodes where replicas are placed.
Cassandra stores replicas on multiple nodes to ensure reliability and fault tolerance. A replication strategy
determines the nodes where replicas are placed. The total number of replicas across the cluster is referred
to as the replication factor. A replication factor of 1 means that there is only one copy of each row on one
node. A replication factor of 2 means two copies of each row, where each copy is on a different node. All
replicas are equally important; there is no primary or master replica. As a general rule, the replication factor
18
Understanding the architecture
should not exceed the number of nodes in the cluster. However, you can increase the replication factor and
then add the desired number of nodes later.
Two replication strategies are available:
• SimpleStrategy: Use for a single data center only. If you ever intend more than one data center, use
the NetworkTopologyStrategy.
• NetworkTopologyStrategy: Highly recommended for most deployments because it is much easier
to expand to multiple data centers when required by future expansion.
SimpleStrategy
Use only for a single data center. SimpleStrategy places the first replica on a node determined by
the partitioner. Additional replicas are placed on the next nodes clockwise in the ring without considering
topology (rack or data center location).
NetworkTopologyStrategy
Use NetworkTopologyStrategy when you have (or plan to have) your cluster deployed across multiple
data centers. This strategy specify how many replicas you want in each data center.
NetworkTopologyStrategy places replicas in the same data center by walking the ring clockwise until
reaching the first node in another rack. NetworkTopologyStrategy attempts to place replicas on distinct
racks because nodes in the same rack (or similar physical grouping) often fail at the same time due to
power, cooling, or network issues.
When deciding how many replicas to configure in each data center, the two primary considerations are (1)
being able to satisfy reads locally, without incurring cross data-center latency, and (2) failure scenarios. The
two most common ways to configure multiple data center clusters are:
• Two replicas in each data center: This configuration tolerates the failure of a single node per replication
group and still allows local reads at a consistency level of ONE.
• Three replicas in each data center: This configuration tolerates either the failure of a one node per
replication group at a strong consistency level of LOCAL_QUORUM or multiple node failures per data
center using consistency level ONE.
Asymmetrical replication groupings are also possible. For example, you can have three replicas in one data
center to serve real-time application requests and use a single replica elsewhere for running analytics.
Partitioners
A partitioner determines how data is distributed across the nodes in the cluster (including replicas).
A partitioner determines how data is distributed across the nodes in the cluster (including replicas).
Basically, a partitioner is a function for deriving a token representing a row from its partion key, typically by
hashing. Each row of data is then distributed across the cluster by the value of the token.
Both the Murmur3Partitioner and RandomPartitioner use tokens to help assign equal portions
of data to each node and evenly distribute data from all the tables throughout the ring or other grouping,
such as a keyspace. This is true even if the tables use different partition keys, such as usernames or
19
Understanding the architecture
timestamps. Moreover, the read and write requests to the cluster are also evenly distributed and load
balancing is simplified because each part of the hash range receives an equal number of rows on average.
For more detailed information, see Consistent hashing.
Cassandra offers the following partitioners:
• Murmur3Partitioner (default): uniformly distributes data across the cluster based on MurmurHash
hash values.
• RandomPartitioner: uniformly distributes data across the cluster based on MD5 hash values.
• ByteOrderedPartitioner: keeps an ordered distribution of data lexically by key bytes
The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right
choice for new clusters in almost all cases.
Set the partitioner in the cassandra.yaml file:
• Murmur3Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
• RandomPartitioner: org.apache.cassandra.dht.RandomPartitioner
• ByteOrderedPartitioner: org.apache.cassandra.dht.ByteOrderedPartitioner
Note: If using virtual nodes (vnodes), you do not need to calculate the tokens. If not using vnodes,
you must calculate the tokens to assign to the initial_token parameter in the cassandra.yaml file.
See Generating tokens and use the method for the type of partitioner you are using.
The location of the cassandra.yaml file depends on the type of installation:
Murmur3Partitioner
The Murmur3Partitioner provides faster hashing and improved performance than the previous default
partitioner.
The Murmur3Partitioner provides faster hashing and improved performance than the previous default
partitioner (RandomPartitioner). You can only use Murmur3Partitioner for new clusters; you cannot
change the partitioner in existing clusters. The Murmur3Partitioner uses the MurmurHash function.
This hashing function creates a 64-bit hash value of the partition key. The possible range of hash values is
63 63
from -2 to +2 -1.
When using the Murmur3Partitioner, you can page through all rows using the token function in a CQL
query.
RandomPartitioner
The default partitioner prior to Cassandra 1.2.
The RandomPartitioner was the default partitioner prior to Cassandra 1.2. It is included for backwards
compatibility. You can use it in later versions of Cassandra, even when using virtual nodes (vnodes).
However, if you don't use vnodes, you must calculate the tokens, as described in Generating tokens.
The RandomPartition distributes data evenly across the nodes using an MD5 hash value of the row
127
key. The possible range of hash values is from 0 to 2 -1.
When using the RandomPartitioner, you can page through all rows using the token function in a CQL
query.
20
Understanding the architecture
ByteOrderedPartitioner
Cassandra provides this partitioner for ordered partitioning. It is included for backwards compatibility.
Cassandra provides the ByteOrderedPartitioner for ordered partitioning. It is included for backwards
compatibility. This partitioner orders rows lexically by key bytes. You calculate tokens by looking at the
actual values of your partition key data and using a hexadecimal representation of the leading character(s)
in a key. For example, if you wanted to partition rows alphabetically, you could assign an A token using its
hexadecimal representation of 41.
Using the ordered partitioner allows ordered scans by primary key. This means you can scan rows as
though you were moving a cursor through a traditional index. For example, if your application has user
names as the partition key, you can scan rows for users whose names fall between Jake and Joe. This
type of query is not possible using randomly partitioned partition keys because the keys are stored in the
order of their MD5 hash (not sequentially).
Although having the capability to do range scans on rows sounds like a desirable feature of ordered
partitioners, there are ways to achieve the same functionality using table indexes.
Using an ordered partitioner is not recommended for the following reasons:
Difficult load balancing
More administrative overhead is required to load balance the cluster. An ordered partitioner requires
administrators to manually calculate partition ranges based on their estimates of the partition key distribution.
In practice, this requires actively moving node tokens around to accommodate the actual distribution of data
once it is loaded.
Sequential writes can cause hot spots
If your application tends to write or update a sequential block of rows at a time, then the writes are not be
distributed across the cluster; they all go to one node. This is frequently a problem for applications dealing
with timestamped data.
Uneven load balancing for multiple tables
If your application has multiple tables, chances are that those tables have different row keys and different
distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven
distribution for another table in the same cluster.
Snitches
A snitch determines which data centers and racks nodes belong to.
A snitch determines which data centers and racks nodes belong to. They inform Cassandra about the
network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by
grouping machines into data centers and racks. Specifically, the replication strategy places the replicas
based on the information provided by the new snitch. All nodes must return to the same rack and data
center. Cassandra does its best not to have more than one replica on the same rack (which is not
necessarily a physical location).
Note: If you change snitches, you may need to perform additional steps because the snitch affects
where replicas are placed. See Switching snitches.
Dynamic snitching
Monitors the performance of reads from the various replicas and chooses the best replica based on this
history.
By default, all snitches also use a dynamic snitch layer that monitors read latency and, when possible,
routes requests away from poorly-performing nodes. The dynamic snitch is enabled by default and is
recommended for use in most deployments. For information on how this works, see Dynamic snitching
in Cassandra: past, present, and future. Configure dynamic snitch thresholds for each node in the
cassandra.yaml configuration file.
21
Understanding the architecture
For more information, see the properties listed under Failure detection and recovery.
The location of the cassandra.yaml file depends on the type of installation:
SimpleSnitch
The SimpleSnitch is used only for single-data center deployments.
The SimpleSnitch (default) is used only for single-data center deployments. It does not recognize data
center or rack information and can be used only for single-data center deployments or single-zone in public
clouds. It treats strategy order as proximity, which can improve cache locality when disabling read repair.
Using a SimpleSnitch, you define the keyspace to use SimpleStrategy and specify a replication factor.
RackInferringSnitch
Determines the location of nodes by rack and data center corresponding to the IP addresses.
The RackInferringSnitch determines the proximity of nodes by rack and data center, which are assumed to
correspond to the 3rd and 2nd octet of the node's IP address, respectively. This snitch is best used as an
example for writing a custom snitch class (unless this happens to match your deployment conventions).
PropertyFileSnitch
Determines the location of nodes by rack and data center.
Procedure
If you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data
center for replicating analytics data, the cassandra-topology.properties file might look like this:
Note: Data center and rack names are case-sensitive.
22
Understanding the architecture
175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1
120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2
110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1
50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2
172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1
GossipingPropertyFileSnitch
Automatically updates all nodes using gossip when adding new nodes and is recommended for production.
This snitch is recommended for production. It uses rack and data center information for the local node
defined in the cassandra-rackdc.properties file and propagates this information to other nodes via
gossip.
The cassandra-rackdc.properties file defines the default data center and rack used by this snitch:
Note: Data center and rack names are case-sensitive.
dc=DC1
rack=RAC1
To save bandwidth, add the prefer_local=true option. This option tells Cassandra to use the local IP
address when communication is not across different data centers.
To allow migration from the PropertyFileSnitch, the GossipingPropertyFileSnitch uses the cassandra-
topology.properties file when present.
The location of the cassandra-rackdc.properties file depends on the type of installation:
Ec2Snitch
Use the Ec2Snitch with Amazon EC2 in a single region.
Use the Ec2Snitch for simple cluster deployments on Amazon EC2 where all nodes in the cluster are within
a single region.
23
Understanding the architecture
In EC2 deployments , the region name is treated as the data center name and availability zones are
treated as racks within a data center. For example, if a node is in the us-east-1 region, us-east is the data
center name and 1 is the rack location. (Racks are important for distributing replicas, but not for data center
naming.) Because private IPs are used, this snitch does not work across multiple regions.
If you are using only a single data center, you do not need to specify any properties.
If you need multiple data centers, set the dc_suffix options in the cassandra-rackdc.properties file. Any
other lines are ignored.
For example, for each node within the us-east region, specify the data center in its cassandra-
rackdc.properties file:
Note: Data center names are case-sensitive.
• node0
dc_suffix=_1_cassandra
• node1
dc_suffix=_1_cassandra
• node2
dc_suffix=_1_cassandra
• node3
dc_suffix=_1_cassandra
• node4
dc_suffix=_1_analytics
• node5
dc_suffix=_1_search
This results in three data centers for the region:
us-east_1_cassandra
us-east_1_analytics
us-east_1_search
Note: The data center naming convention in this example is based on the workload. You can use
other conventions, such as DC1, DC2 or 100, 200.
Ec2MultiRegionSnitch
Use the Ec2MultiRegionSnitch for deployments on Amazon EC2 where the cluster spans multiple regions.
Use the Ec2MultiRegionSnitch for deployments on Amazon EC2 where the cluster spans multiple regions.
You must configure settings in both the cassandra.yaml file and the property file (cassandra-
rackdc.properties) used by the Ec2MultiRegionSnitch.
24
Understanding the architecture
To find the public IP address, from each of the seed nodes in EC2:
$ curl http://instance-data/latest/meta-data/public-ipv4
Note: Do not make all nodes seeds, see Internode communications (gossip).
3. Be sure that the storage_port or ssl_storage_port is open on the public IP firewall.
us-east_1_cassandra us-west_1_cassandra
25
Understanding the architecture
GoogleCloudSnitch
Use the GoogleCloudSnitch for Cassandra deployments on Google Cloud Platform across one or more
regions.
Use the GoogleCloudSnitch for Cassandra deployments on Google Cloud Platform across one or more
regions. The region is treated as a data center and the availability zones are treated as racks within the
data center. All communication occurs over private IP addresses within the same logical network.
The region name is treated as the data center name and zones are treated as racks within a data center.
For example, if a node is in the us-central1-a region, us-central1 is the data center name and a is the rack
location. (Racks are important for distributing replicas, but not for data center naming.) This snitch can
work across multiple regions without additional configuration.
If you are using only a single data center, you do not need to specify any properties.
If you need multiple data centers, set the dc_suffix options in the cassandra-rackdc.properties file. Any
other lines are ignored.
For example, for each node within the us-central1 region, specify the data center in its cassandra-
rackdc.properties file:
Note: Data center names are case-sensitive.
• node0
dc_suffix=_a_cassandra
• node1
dc_suffix=_a_cassandra
• node2
dc_suffix=_a_cassandra
• node3
dc_suffix=_a_cassandra
26
Understanding the architecture
• node4
dc_suffix=_a_analytics
• node5
dc_suffix=_a_search
Note: Data center and rack names are case-sensitive.
CloudstackSnitch
Use the CloudstackSnitch for Apache Cloudstack environments.
Use the CloudstackSnitch for Apache Cloudstack environments. Because zone naming is free-form in
Apache Cloudstack, this snitch uses the widely-used <country> <location> <az> notation.
Client requests
Client read or write requests can be sent to any node in the cluster because all nodes in Cassandra are
peers.
Client read or write requests can be sent to any node in the cluster because all nodes in Cassandra are
peers. When a client connects to a node and issues a read or write request, that node serves as the
coordinator for that particular client operation.
The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas)
that own the data being requested. The coordinator determines which nodes in the ring should get the
request based on the cluster configured partitioner and replica placement strategy.
27
Planning a cluster deployment
Memory
The more memory a Cassandra node has, the better read performance. More RAM also allows memory
tables (memtables) to hold more recently written data. Larger memtables lead to a fewer number of
SSTables being flushed to disk and fewer files to scan during a read. The ideal amount of RAM depends
on the anticipated size of your hot data.
For both dedicated hardware and virtual environments:
• Production: 16GB to 64GB; the minimum is 8GB.
• Development in non-loading testing environments: no less than 4GB.
• For setting Java heap space, see Tuning Java resources.
CPU
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound. (All writes go to
the commit log, but Cassandra is so efficient in writing that the CPU is the limiting factor.) Cassandra is
highly concurrent and uses as many CPU cores as available:
• Production environments:
• For dedicated hardware, 8-core CPU processors are the current price-performance sweet spot.
• For virtual environments, 4 to 8-core CPU processors.
• Development in non-loading testing environments:
• For dedicated hardware, 2-core CPU processors.
• For virtual environments, 2-core CPU processors.
28
Planning a cluster deployment
plan for drive failures and have spares available. A large variety of SSDs are available on the market from
server vendors and third-party drive manufacturers.
For purchasing SSDs, the best recommendation is to make SSD endurance decisions not based on
workload, but on how difficult it is to change drives when they fail. Remember, your data is protected
because Cassandra replicates data across the cluster. Buying strategies include:
• If drives are quickly available, buy the cheapest drives that provide the performance you want.
• If it is more challenging to swap the drives, consider higher endurance models, possibly starting in the
mid range, and then choose replacements of higher or lower endurance based on the failure rates of
the initial model chosen.
• Always buy cheap SSDs and keep several spares online and unused in the servers until the initial
drives fail. This way you can replace the drives without touching the server.
DataStax customers that need help in determining the most cost-effective option for a given deployment
and workload, should contact their Solutions Engineer or Architect.
Disk space
Disk space depends on usage, so it's important to understand the mechanism. Cassandra writes data
to disk when appending data to the commit log for durability and when flushing memtable to SSTable
data files for persistent storage. The commit log has a different access pattern (read/writes ratio) than the
pattern for accessing data from SSTables. This is more important for spinning disks than for SSDs (solid
state drives).
SSTables are periodically compacted. Compaction improves performance by merging and rewriting
data and discarding old data. However, depending on the type of compaction strategy and size of
the compactions, during compaction disk utilization and data directory volume temporarily increases.
For this reason, you should leave an adequate amount of free disk space available on a node.
For large compactions, leave an adequate amount of free disk space available on a node: 50%
(worst case) for SizeTieredCompactionStrategy and DateTieredCompactionStrategy, and 10% for
LeveledCompactionStrategy. For more information about compaction, see:
• Compaction
• The Apache Cassandra storage engine
• Leveled Compaction in Apache Cassandra
• When to Use Leveled Compaction
• DateTieredCompactionStrategy: Compaction for Time Series Data and Getting Started with Time
Series Data Modeling
For information on calculating disk size, see Calculating usable disk capacity.
Recommendations:
Capacity per node
Most workloads work best with a capacity under 500GB to 1TB per node depending on I/O. Maximum
recommended capacity for Cassandra 1.2 and later is 3 to 5TB per node for uncompressed data. For
Cassandra 1.1, it is 500 to 800GB per node. Be sure to account for replication.
Capacity and I/O
When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read
throughput rate). Some workloads are best served by using less expensive SATA disks and scaling disk
capacity and I/O by adding more nodes (with more RAM).
Number of disks - SATA
Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories.
At a minimum the commit log should be on its own partition.
Commit log disk - SATA
The disk need not be large, but it should be fast enough to receive all of your writes as appends (sequential
I/O).
29
Planning a cluster deployment
Number of nodes
Prior to version 1.2, the recommended size of disk space per node was 300 to 500GB. Improvement to
Cassandra 1.2, such as JBOD support, virtual nodes (vnodes), off-heap Bloom filters, and parallel leveled
compaction (SSD nodes only), allow you to use few machines with multiple terabytes of disk space.
Network
Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and
replication of data across nodes. Be sure that your network can handle traffic between nodes without
bottlenecks. You should bind your interfaces to separate Network Interface Cards (NIC). You can use
public or private depending on your requirements.
• Recommended bandwidth is 1000 Mbit/s (gigabit) or greater.
• Thrift/native protocols use the rpc_address.
• Cassandra's internal storage protocol uses the listen_address.
Cassandra efficiently routes requests to replicas that are geographically closest to the coordinator node
and chooses a replica in the same rack if possible; it always chooses replicas located in the same data
center over replicas in a remote data center.
Firewall
If using a firewall, make sure that nodes within a cluster can reach each other. See Configuring firewall port
access.
30
Planning a cluster deployment
31
Planning a cluster deployment
• Adding capacity by increasing the number of EBS volumes per host does not scale. You can easily
surpass the ability of the system to keep effective buffer caches and concurrently serve requests for all
of the data it is responsible for managing.
Note: Use only ephemeral instance-store devices for Cassandra data storage.
For more information and graphs related to ephemeral versus EBS performance, see the blog article
Systematic Look at EC2 I/O.
Procedure
1. Start with the raw capacity of the physical disks:
32
Planning a cluster deployment
During normal operations, Cassandra routinely requires disk capacity for compaction and repair
operations. For optimal performance and cluster health, DataStax recommends not filling your disks
to capacity, but running at 50% to 80% capacity depending on the compaction strategy and size of the
compactions.
3. Calculate the usuable disk space accounting for file system formatting overhead (roughly 10 percent):
Procedure
• Determine column overhead:
Every column in Cassandra incurs 15 bytes of overhead. Since each row in a table can have different
column names as well as differing numbers of columns, metadata is stored for each column. For
counter columns and expiring columns, you should add an additional 8 bytes (23 bytes total).
• Account for row overhead.
Every row in Cassandra incurs 23 bytes of overhead.
• Estimate primary key index size:
The replication factor plays a role in how much disk capacity is used. For a replication factor of 1, there
is no overhead for replicas (as only one copy of data is stored in the cluster). If replication factor is
greater than 1, then your total data storage requirement will include replication overhead.
33
Planning a cluster deployment
Anti-patterns in Cassandra
Implementation or design patterns that are ineffective and/or counterproductive in Cassandra production
installations. Correct patterns are suggested in most cases.
Implementation or design patterns that are ineffective and/or counterproductive in Cassandra production
installations. Correct patterns are suggested in most cases.
34
Planning a cluster deployment
35
Planning a cluster deployment
Load balancers
Cassandra was designed to avoid the need for load balancers. Putting load balancers between Cassandra
and Cassandra clients is harmful to performance, cost, availability, debugging, testing, and scaling. All
high-level clients, such as the Java and Python drivers for Cassandra, implement load balancing directly.
Insufficient testing
Be sure to test at scale and production loads. This the best way to ensure your system will function
properly when your application goes live. The information you gather from testing is the best indicator of
what throughput per node is needed for future expansion calculations.
To properly test, set up a small cluster with production loads. There will be a maximum throughput
associated with each node count before the cluster can no longer increase performance. Take the
maximum throughput at this cluster size and apply it linearly to a cluster size of a different size. Next
extrapolate (graph) your results to predict the correct cluster sizes for required throughputs for your
production cluster. This allows you to predict the correct cluster sizes for required throughputs in the future.
The Netflix case study shows an excellent example for testing.
More anti-patterns
For more about anti-patterns, visit Matt Dennis` slideshare.
36
Installing DataStax Community
Procedure
In a terminal window:
1. Check which version of Java is installed by running the following command:
$ java -version
It is recommended to use the latest version of Oracle Java 8 on all nodes. (Oracle Java 7 is also
supported.)
See Installing the JRE on RHEL-based systems.
2. Add the DataStax Community repository to the /etc/yum.repos.d/datastax.repo:
[datastax]
name = DataStax Repo for Apache Cassandra
baseurl = http://rpm.datastax.com/community
enabled = 1
gpgcheck = 0
37
Installing DataStax Community
What to do next
• Initializing a multiple node cluster (single data center)
• Initializing a multiple node cluster (multiple data centers)
• Recommended production settings
• Installing OpsCenter
• Key components for configuring Cassandra
• Starting Cassandra as a service
• Package installation directories
Procedure
In a terminal window:
1. Check which version of Java is installed by running the following command:
$ java -version
It is recommended to use the latest version of Oracle Java 8 on all nodes. (Oracle Java 7 is also
supported.)
See Installing the JRE on RHEL-based systems.
2. Add the DataStax Community repository to the /etc/apt/sources.list.d/
cassandra.sources.list
This allows installation of the Oracle JVM instead of the OpenJDK JVM.
38
Installing DataStax Community
b) Save and close the file when you are done adding/editing your sources.
4. Add the DataStax repository key to your aptitude trusted keys.
What to do next
• Initializing a multiple node cluster (single data center)
• Initializing a multiple node cluster (multiple data centers)
• Recommended production settings
• Installing OpsCenter
• Key components for configuring Cassandra
• Starting Cassandra as a service
• Package installation directories
39
Installing DataStax Community
Procedure
In a terminal window:
1. Check which version of Java is installed by running the following command:
$ java -version
It is recommended to use the latest version of Oracle Java 8 on all nodes. (Oracle Java 7 is also
supported.)
See Installing the JRE on RHEL-based systems.
2. Download the DataStax Community:
$ cd dsc-cassandra-2.1.x
The DataStax Community distribution of Cassandra is ready for configuration.
4. Go to the install directory:
$ cd install_location/apache-cassandra-2.1.x
Cassandra is ready for configuration.
What to do next
• Initializing a multiple node cluster (single data center)
• Initializing a multiple node cluster (multiple data centers)
• Recommended production settings
• Installing OpsCenter
• Key components for configuring Cassandra
• Tarball installation directories
• Starting Cassandra as a stand-alone process
40
Installing DataStax Community
http://downloads.datastax.com/community/dsc-cassandra-2.1.0-bin.tar.gz
http://downloads.datastax.com/community/dsc-cassandra-2.1.3-bin.tar.gz
2. Unpack the distribution. For example:
41
Installing DataStax Community
$ sudo rm -r /var/lib/cassandra
$ sudo rm -r /var/log/cassandra
5. Remove the installation directories:
RHEL-based packages:
42
Installing DataStax Community
• Uses RAID0 ephemeral disks for data storage and commit logs.
• Choice of PV (Para-virtualization) or HVM (Hardware-assisted Virtual Machine) instance types. See
Amazon documentation.
• Launches EBS-backed instances for faster start-up, not database storage.
• Uses the private interface for intra-cluster communication.
• Sets the seed nodes cluster-wide.
• Installs OpsCenter (by default).
Note: When creating an EC2 cluster that spans multiple regions and availability zones, use
OpsCenter to set up your cluster. See EC2 clusters spanning multiple regions and availability
zones.
Because Amazon changes the EC2 console intermittently, these instructions have been generalized. For
details on each step, see the User guide in the Amazon Elastic Compute Cloud Documentation.
To install a Cassandra cluster from the DataStax AMI, complete the following tasks:
Procedure
If you need more help, click an informational icon or a link to the Amazon EC2 User Guide.
1. Sign in to the AWS console.
2. From the Amazon EC2 console navigation bar, select the same region as where you will launch the
DataStax Community AMI.
Step 1 in Launch instances provides a list of the available regions.
43
Installing DataStax Community
Table 1: Ports
44
Installing DataStax Community
Warning: The security configuration shown in this example opens up all externally accessible
ports to incoming traffic from any IP address (0.0.0.0/0). The risk of data loss is high. If you
desire a more secure configuration, see the Amazon EC2 help on security groups.
Procedure
You must create a key pair for each region you use.
1. From the Amazon EC2 console navigation bar, select the same region as where you will launch the
DataStax Community AMI.
Step 1 in Launch instances provides a list of the available regions.
Procedure
If you need more help, click an informational icon or a link to the Amazon EC2 User Guide.
45
Installing DataStax Community
Region AMI
HVM instances (Hardware-assisted Virtual Machine - see Amazon documentation.)
us-east-1 ami-ada2b6c4
us-west-1 ami-3cf7c979
us-west-2 ami-1cff962c
eu-west-1 ami-7f33cd08
ap-southeast-1 ami-b47828e6
ap-southeast-2 ami-55d54d6f
ap-northeast-1 ami-714a3770
sa-east-1 ami-1dda7800
PV instances (Paravirtualization - see Amazon documentation.
us-east-1 ami-f9a2b690
us-west-1 ami-32f7c977
us-west-2 ami-16ff9626
eu-west-1 ami-8932ccfe
ap-southeast-1 ami-8c7828de
ap-southeast-2 ami-57d54d6d
ap-northeast-1 ami-6b4a376a
sa-east-1 ami-15da7808
2. In Step 2: Choose an Instance Type, choose the appropriate type.
The recommended instances are:
• Development and light production: m3.large
• Moderate production: m3.xlarge
• SSD production with light data: c3.2xlarge
• Largest heavy production: m3.2xlarge (PV) or i2.2xlarge (HVM)
• Micro, small, and medium types are not supported.
Note: The main difference between m1 and m3 instance types for use with Cassandra is that
m3 instance types have faster, smaller SSD drives and m1 instance types have slower, larger
rotational drives. Use m1 instance types when you have higher tolerance for latency SLAs and
you require smaller cluster sizes, or both. For more aggressive workloads use m3 instance types
with appropriately sized clusters.
When the instance is selected, its specifications are displayed:
46
Installing DataStax Community
Because Amazon updates instance types periodically, see the following docs to help you determine
your hardware and storage requirements:
• Planning a cluster deployment
• User guide in the Amazon Elastic Compute Cloud Documentation
• What is the story with AWS storage
• Get in the Ring with Cassandra and EC2
3. Click Next: Configure Instance Details and configure the instances to suit your requirements:
• Number of instances
• Network - Select Launch into EC2-Classic.
• Advanced Details - Open and add the following options (as text) to the User Data section.
Option Description
--clustername Required. The name of the cluster.
name
--totalnodes Required. The total number of nodes in the cluster.
#_nodes
--version Required. The version of the cluster. Use community to install the latest
community version of DataStax Community.
--opscenter [no] Optional. By default, DataStax OpsCenter is installed on the first instance.
Specify no to disable.
--reflector url Optional. Allows you to use your own reflector. Default: http://
reflector2.datastax.com/reflector2.php
47
Installing DataStax Community
• If you need to create a new key pair, click Choose an existing key pair drop list and select Create
a new key pair. Then create the new key pair as described in Create key pair.
9. Click Launch Instances.
The AMI image configures your cluster and starts Cassandra services. The Launch Status page is
displayed.
10.Click View Instances.
Procedure
1. If necessary, from the EC2 Dashboard, click Running Instances.
You can connect to any node in the cluster. However, one node (Node0) runs OpsCenter and is the
Cassandra seed node.
48
Installing DataStax Community
The AMI image configures your cluster and starts the Cassandra services.
6. After you have logged into a node and the AMI has completed installing and setting up the nodes, the
status is displayed:
The URL for the OpsCenter is displayed when you connect to the node containing it; otherwise it is not
displayed.
7. If you installed OpsCenter, allow 60 to 90 seconds after the cluster has finished
initializing for OpsCenter to start. You can launch OpsCenter using the URL:
http://public_dns_of_first_instance:8888/
The Dashboard should show that the agents are connected.
b) When prompted for credentials for the agent nodes, use the username ubuntu and copy and paste
the entire contents from your private key (.pem).
49
Installing DataStax Community
Procedure
To clear the data from the default directories:
After stopping the service, run the following command:
Procedure
1. Register with GoGrid.
2. Fill out the registration form and complete the account verification.
3. Access the management console with the login credentials you received in your email.
50
Installing DataStax Community
Your cluster automatically starts deploying. A green status indicator shows that a server is up and
running.
Hover over any item to view its details or right-click to display a context menu.
4. Login to one of the servers and validate that the servers are configured and communicating:
Note: You can login to any member of the cluster either with SSH, a third-party client (like
PuTTY), or through the GoGrid Console service.
a) To find your server credentials, right-click the server and select Passwords.
b) From your secure connection client, login to the server with the proper credentials. For example from
SSH:
$ ssh root@ip_address
c) Validate that the cluster is running:
$ nodestool status
Each node should be listed and it's status and state should be UN (Up Normal):
Datacenter: datacenter1
=======================
Status=Up/Down |/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 10.110.94.2 71.46 KB 256 65.9%
3ed072d6-49cb-4713-bd55-ea4de25576a9 rack1
UN 10.110.94.5 40.91 KB 256 66.7%
d5d982bc-6e30-40a0-8fe7-e46d6622c1d5 rack1
UN 10.110.94.4 73.33 KB 256 67.4% f6c3bf08-
d9e5-43c8-85fa-5420db785052 rack1
What to do next
The following provides information about using and configuring Cassandra, OpsCenter, GoGrid, and the
Cassandra Query Language (CQL):
• About Apache Cassandra
51
Installing DataStax Community
• OpsCenter documentation
• GoGrid documentation
• CQL for Cassandra 2.x
Procedure
1. Check which version of the JRE your system is using:
$ java -version
2. If necessary, go to Oracle Java SE Downloads, accept the license agreement, and download the
installer for your distribution.
Note: If installing the Oracle JRE in a cloud environment, accept the license agreement,
download the installer to your local client, and then use scp (secure copy) to transfer the file to
your cloud machines.
3. From the directory where you downloaded the package, run the install:
If you have any problems, set the PATH and JAVA_HOME variables:
export PATH="$PATH:/usr/java/latest/bin"
set JAVA_HOME=/usr/java/latest
52
Installing DataStax Community
5. Make sure your system is now using the correct JRE. For example:
$ java -version
6. If the OpenJDK JRE is still being used, use the alternatives command to switch it. For example:
Selection Command
------------------------------------------------------------
1 /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java
*+ 2 /usr/java/jre1.8.0_45/bin/java
Procedure
1. Check which version of the JRE your system is using:
$ java -version
2. If necessary, go to Oracle Java SE Downloads, accept the license agreement, and download the
installer for your distribution.
Note: If installing the Oracle JRE in a cloud environment, accept the license agreement,
download the installer to your local client, and then use scp (secure copy) to transfer the file to
your cloud machines.
53
Installing DataStax Community
If updating from a previous version that was removed manually, execute the above command twice,
because you'll get an error message the first time.
6. Set the new JRE as the default:
7. Make sure your system is now using the correct JRE. For example:
$ java -version
Optimizing SSDs
For the majority of Linux distributions, SSDs are not configured optimally by default. The following steps
ensures best practice settings for SSDs:
1. Ensure that the SysFS rotational flag is set to false (zero).
This overrides any detection by the operating system to ensure the drive is considered an SSD.
2. Repeat for any block devices created from SSD storage, such as mdarrays.
3. Set the IO scheduler to either deadline or noop:
• The noop scheduler is the right choice when the target block device is an array of SSDs behind a
high-end IO controller that performs IO optimization.
• The deadline scheduler optimizes requests to minimize IO latency. If in doubt, use the deadline
scheduler.
4. Set the read-ahead value for the block device to 8KB.
This setting tells the operating system not to read extra bytes, which can increase IO time and pollute
the cache with bytes that weren’t requested by the user.
For example, if the SSD is /dev/sda, in /etc/rc.local:
54
Installing DataStax Community
#OR...
#echo noop > /sys/block/sda/queue/scheduler
echo 0 > /sys/class/block/sda/queue/rotational
echo 8 > /sys/class/block/sda/queue/read_ahead_kb
Tarball installs: Ensure that the following settings are included in the /etc/security/limits.conf file:
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
If you run Cassandra as root, some Linux distributions such as Ubuntu, require setting the limits for root
explicitly instead of using *:
For CentOS, RHEL, OEL systems, also set the nproc limits in /etc/security/limits.d/90-
nproc.conf:
* - nproc 32768
vm.max_map_count = 131072
To make the changes take effect, reboot the server or run the following command:
$ sudo sysctl -p
To confirm the limits are applied to the Cassandra process, run the following command where pid is the
process ID of the currently running Cassandra process:
$ cat /proc/<pid>/limits
55
Installing DataStax Community
Disable swap
You must disable swap entirely. Failure to do so can severely lower performance. Because Cassandra
has multiple replicas and transparent failover, it is preferable for a replica to be killed immediately when
memory is low rather than go into swap. This allows traffic to be immediately redirected to a functioning
replica instead of continuing to hit the replica that has high latency due to swapping. If your system has a
lot of DRAM, swapping still lowers performance significantly because the OS swaps out executable code
so that more DRAM is available for caching disks.
If you insist on using swap, you can set vm.swappiness=1. This allows the kernel swap out the absolute
least used parts.
Synchronize clocks
The clocks on all nodes should be synchronized. You can use NTP (Network Time Protocol) or other
methods.
This is required because columns are only overwritten if the timestamp in the new version of the column is
more recent than the existing column.
56
Initializing a cluster
Initializing a cluster
Topics for deploying a cluster.
Procedure
1. Suppose you install Cassandra on these nodes:
Note: It is a best practice to have more than one seed node per data center.
57
Initializing a cluster
2. If you have a firewall running in your cluster, you must open certain ports for communication between
the nodes. See Configuring firewall port access.
3. If Cassandra is running, you must stop the server and clear the data:
Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use
the same cluster name.
Package installations:
a) Stop Cassandra:
cluster_name: 'MyCassandraCluster'
num_tokens: 256
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "110.82.155.0,110.82.155.3"
listen_address:
rpc_address: 0.0.0.0
endpoint_snitch: GossipingPropertyFileSnitch
58
Initializing a cluster
5. In the cassandra-rackdc.properties file, assign the data center and rack names you determined
in the Prerequisites. For example:
6. After you have installed and configured Cassandra on all nodes, start the seed nodes one at a time,
and then start the rest of the nodes.
Note: If the node has restarted because of automatic restart, you must first stop the node and
clear the data directories, as described above.
Package installations:
$ cd install_location
$ bin/cassandra
7. To check that the ring is up and running, run:
Package installations:
$ nodetool status
Tarball installations:
$ cd install_location
$ bin/nodetool status
Each node should be listed and it's status and state should be UN (Up Normal).
59
Initializing a cluster
This example describes installing a six node cluster spanning two data centers. Each node is configured to
use the GossipingPropertyFileSnitch (multiple rack aware) and 256 virtual nodes (vnodes).
In Cassandra, the term data center is a grouping of nodes. Data center is synonymous with replication
group, that is, a grouping of nodes configured together for replication purposes.
Procedure
1. Suppose you install Cassandra on these nodes:
Note: It is a best practice to have more than one seed node per data center.
2. If you have a firewall running in your cluster, you must open certain ports for communication between
the nodes. See Configuring firewall port access.
3. If Cassandra is running, you must stop the server and clear the data:
Doing this removes the default cluster_name (Test Cluster) from the system table. All nodes must use
the same cluster name.
Package installations:
a) Stop Cassandra:
60
Initializing a cluster
cluster_name: 'MyCassandraCluster'
num_tokens: 256
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "10.168.66.41,10.176.170.59"
listen_address:
endpoint_snitch: GossipingPropertyFileSnitch
Nodes 3 to 5
6. After you have installed and configured Cassandra on all nodes, start the seed nodes one at a time,
and then start the rest of the nodes.
Note: If the node has restarted because of automatic restart, you must first stop the node and
clear the data directories, as described above.
Package installations:
61
Initializing a cluster
$ cd install_location
$ bin/cassandra
7. To check that the ring is up and running, run:
Package installations:
$ nodetool status
Tarball installations:
$ cd install_location
$ bin/nodetool status
Each node should be listed and it's status and state should be UN (Up Normal).
62
Security
Security
Topics for securing Cassandra.
Securing Cassandra
Cassandra provides various security features to the open source community.
Cassandra provides these security features to the open source community.
• Client-to-node encryption
Cassandra includes an optional, secure form of communication from a client machine to a database
cluster. Client to server SSL ensures data in flight is not compromised and is securely transferred back/
forth from client machines.
• Authentication based on internally controlled login accounts/passwords
Administrators can create users who can be authenticated to Cassandra database clusters using the
CREATE USER command. Internally, Cassandra manages user accounts and access to the database
cluster using passwords. User accounts may be altered and dropped using the Cassandra Query
Language (CQL).
• Object permission management
Once authenticated into a database cluster using either internal authentication, the next security issue
to be tackled is permission management. What can the user do inside the database? Authorization
capabilities for Cassandra use the familiar GRANT/REVOKE security paradigm to manage object
permissions.
SSL encryption
Topics for using SSL in Cassandra.
Client-to-node encryption
Client-to-node encryption protects data in flight from client machines to a database cluster using SSL
(Secure Sockets Layer).
63
Security
Procedure
On each node under client_encryption_options:
• Enable encryption.
• Set the appropriate paths to your .keystore and .truststore files.
• Provide the required passwords. The passwords must match the passwords used when generating the
keystore and truststore.
• To enable client certificate authentication, set require_client_auth to true. (Available starting with
Cassandra 1.2.3.)
Example
client_encryption_options:
enabled: true
keystore: conf/.keystore ## The path to your .keystore file
keystore_password: <keystore password> ## The password you used when
generating the keystore.
truststore: conf/.truststore
truststore_password: <truststore password>
require_client_auth: <true or false>
Node-to-node encryption
Node-to-node encryption protects data transferred between nodes in a cluster, including gossip
communications, using SSL (Secure Sockets Layer).
Procedure
On each node under sever_encryption_options:
• Enable internode_encryption.
The available options are:
• all
• none
• dc: Cassandra encrypts the traffic between the data centers.
• rack: Cassandra encrypts the traffic between the racks.
• Set the appropriate paths to your .keystore and .truststore files.
64
Security
• Provide the required passwords. The passwords must match the passwords used when generating the
keystore and truststore.
• To enable client certificate authentication, set require_client_auth to true. (Available starting with
Cassandra 1.2.3.)
Example
server_encryption_options:
internode_encryption: <internode_option>
keystore: resources/dse/conf/.keystore
keystore_password: <keystore password>
truststore: resources/dse/conf/.truststore
truststore_password: <truststore password>
require_client_auth: <true or false>
Procedure
1. To run cqlsh with SSL encryption, create a .cassandra/cqlshrc file in your home or client program
directory.
Sample files are available in the following directories:
• Package installations: /etc/cassandra
• Tarball installations: install_location/conf
2. Start cqlsh with the --ssl option.
Example
[authentication]
username = fred
password = !!bang!!$
[connection]
hostname = 127.0.0.1
port = 9042
[ssl]
certfile = ~/keys/cassandra.cert
validate = true ## Optional, true by default
userkey = ~/key.pem ## Provide when require_client_auth=true
usercert = ~/cert.pem ## Provide when require_client_auth=true
65
Security
Note: When validate is enabled, the host in the certificate is compared to the host of the machine
that it is connected to. The SSL certificate must be provided either in the configuration file or as an
environment variable. The environment variables (SSL_CERTFILE and SSL_VALIDATE) override
any options set in this file.
Procedure
1. Generate the private and public key pair for the nodes of the cluster.
A prompt for the new keystore and key password appears.
2. Leave key password the same as the keystore password.
3. Repeat steps 1 and 2 on each node using a different alias for each one.
4. Export the public part of the certificate to a separate file and copy these certificates to all other nodes.
5. Add the certificate of each node to the truststore of each node, so nodes can verify the identity of other
nodes.
Procedure
1. Generate the certificate as described in Client-to-node encryption.
66
Security
2. Import the user's certificate into every node's truststore using keytool:
Internal authentication
Topics for internal authentication.
Internal authentication
Internal authentication is based on Cassandra-controlled login accounts and passwords.
Like object permission management using internal authorization, internal authentication is based on
Cassandra-controlled login accounts and passwords. Internal authentication works for the following clients
when you provide a user name and password to start up the client:
• Astyanax
• cassandra-cli
• cqlsh
• DataStax drivers - produced and certified by DataStax to work with Cassandra.
• Hector
• pycassa
Internal authentication stores usernames and bcrypt-hashed passwords in the system_auth.credentials
table.
PasswordAuthenticator is an IAuthenticator implementation that you can use to configure Cassandra for
internal authentication out-of-the-box.
Configuring authentication
Steps for configuring authentication.
Procedure
1. Change the authenticator option in the cassandra.yaml file to PasswordAuthenticator.
67
Security
authenticator: PasswordAuthenticator
2. Increase the replication factor for the system_auth keyspace to N (number of nodes).
If you use the default, 1, and the node with the lone replica goes down, you will not be able to log into
the cluster because the system_auth keyspace was not replicated.
3. Restart the Cassandra client.
The default superuser name and password that you use to start the client is stored in Cassandra.
5. Create another superuser, not named cassandra. This step is optional but highly recommended.
6. Log in as that new superuser.
7. Change the cassandra user password to something long and incomprehensible, and then forget about
it. It won't be used again.
8. Take away the cassandra user's superuser status.
9. Use the CQL statements listed previously to set up user accounts and then grant permissions to access
the database objects.
Procedure
1. Open a text editor and create a file that specifies a user name and password.
[authentication]
username = fred
password = !!bang!!$
Internal authorization
Topics about internal authorization.
68
Security
Object permissions
Granting or revoking permissions to access Cassandra data.
Cassandra provides the familiar relational database GRANT/REVOKE paradigm to grant or revoke
permissions to access Cassandra data. A superuser grants initial permissions, and subsequently a user
may or may not be given the permission to grant/revoke permissions. Object permission management is
based on internal authorization.
Read access to these system tables is implicitly given to every authenticated user because the tables are
used by most Cassandra tools:
• system.schema_keyspace
• system.schema_columns
• system.schema_columnfamilies
• system.local
• system.peers
Procedure
1. In the cassandra.yaml file, comment out the default AllowAllAuthorizer and add the
CassandraAuthorizer.
authorizer: CassandraAuthorizer
Results
CQL supports these authorization statements:
• GRANT
• LIST PERMISSIONS
• REVOKE
69
Security
Port Description
number
22 SSH port
8888 OpsCenter website. The opscenterd daemon listens
on this port for HTTP requests coming directly from
the browser.
Port Description
number
7000 Cassandra inter-node cluster communication.
7001 Cassandra SSL inter-node cluster communication.
7199 Cassandra JMX monitoring port.
Port Description
number
9042 Cassandra client port.
9160 Cassandra client port (Thrift).
Port Description
number
61620 OpsCenter monitoring port. The opscenterd daemon
listens on this port for TCP traffic coming from the
agent.
61621 OpsCenter agent port. The agents listen on this port
for SSL traffic initiated by OpsCenter.
70
Security
Procedure
1. Open the cassandra-env.sh file for editing and update or add these lines:
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.authenticate=true"
JVM_OPTS="$JVM_OPTS -Dcom.sun.management.jmxremote.password.file=/etc/
cassandra/jmxremote.password"
If the LOCAL_JMX setting is in your file:
LOCAL_JMX=no
2. Copy the jmxremote.password.template from /jre_install_location/lib/management/
to /etc/cassandra/ and rename it to jmxremote.password:
cp /<jre_install_dir/lib/management/jmxremote.password.template /etc/
cassandra/jmxremote.password
3. Change the ownership of jmxremote.password to the user you run cassandra with and change
permission to read only:
chown cassandra:cassandra /etc/cassandra/jmxremote.password
chmod 400 /etc/cassandra/jmxremote.password
4. Edit jmxremote.password and add the user and password for JMX-compliant utilities:
monitorRole QED
controlRole R&D
cassandra cassandrapassword
Note: This cassandra user and cassandra password is just an example. Specify the user and
password for your environment.
5. Add the cassandra user with read and write permission to /jre_install_location/lib/
management/jmxremote.access:
monitorRole readonly
cassandra readwrite
controlRole readwrite \
create javax.management.monitor.,javax.management.timer. \
unregister
6. Restart Cassandra.
7. Run nodetool with the cassandra user and password.
Results
If you run nodetool without user and password, you see an error similar to:
root@VM1 cassandra]# nodetool status
71
Security
72
Database internals
Database internals
Topics about the Cassandra database.
Storage engine
A description about Cassandra's storage structure and engine.
Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational
database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in
a large distributed system, can produce stalls in read performance and other problems. For example, two
clients read at the same time, one overwrites the row to make update A, and then the other overwrites the
row to make update B, removing update A. Reading before writing also corrupts caches and increases
IO requirements. To avoid a read-before-write condition, the storage engine groups inserts/updates to be
made, and sequentially writes only the updated parts of a row in append mode. Cassandra never re-writes
or re-reads existing data, and never overwrites the rows in place.
A log-structured engine that avoids overwrites and uses sequential IO to update data is essential for writing
to hard disks (HDD) and solid-state disks (SSD). On HDD, writing randomly involves a higher number of
seek operations than sequential writing. The seek penalty incurred can be substantial. Using sequential
IO, and thereby avoiding write amplification and disk failure, Cassandra accommodates inexpensive,
consumer SSDs extremely well.
73
Database internals
Memtables and SSTables are maintained per table. SSTables are immutable, not written to again after the
memtable is flushed. Consequently, a partition is typically stored across multiple SSTable files.
For each SSTable, Cassandra creates these structures:
74
Database internals
• Partition index
A list of partition keys and the start position of rows in the data file (on disk)
• Partition summary (in memory)
A sample of the partition index.
• Bloom filter
Compaction
Periodic compaction is essential to a healthy Cassandra database because Cassandra does not insert/
update in place. As inserts/updates occur, instead of overwriting the rows, Cassandra writes a new
timestamped version of the inserted or updated data in another SSTable. Cassandra manages the
accumulation of SSTables on disk using compaction.
Cassandra also does not delete in place because the SSTable is immutable. Instead, Cassandra marks
data to be deleted using a tombstone. Tombstones exist for a configured time period defined by the
gc_grace_seconds value set on the table.
During compaction, there is a temporary spike in disk space usage and disk I/O because the old and new
SSTables co-exist. This diagram depicts the compaction process:
Compaction merges the data in each SSTable by partition key, selecting the latest data for storage
based on its timestamp. Cassandra can merge the data performantly, without random IO, because rows
are sorted by partition key within each SSTable. After evicting tombstones and removing deleted data,
columns, and rows, the compaction process consolidates SSTables into a single file. The old SSTable
files are deleted as soon as any pending reads finish using the files. Disk space occupied by old SSTables
becomes available for reuse.
Cassandra 2.1 improves read performance after compaction by performing an incremental replacement
of compacted SSTables. Instead of waiting for the entire compaction to finish and then throwing away the
75
Database internals
old SSTable (and cache), Cassandra can read data directly from the new SSTable even before it finishes
writing.
As data is written to the new SSTable and reads are directed to it, the corresponding data in the old
SSTables is no longer accessed and is evicted from the page cache. Thus begins an incremental process
of caching the new SSTable, while directing reads away from the old one. The dramatic cache miss is
gone. Cassandra provides predictable high performance even under heavy load.
Starting compaction
You can configure these types of compaction to run periodically:
• SizeTieredCompactionStrategy
For write-intensive workloads
• LeveledCompactionStrategy
For read-intensive workloads
• DateTieredCompactionStrategy
For time series data and expiring (TTL) data
You can manually start compaction using the nodetool compact command.
For more information about compaction strategies, see When to Use Leveled Compaction and Leveled
Compaction in Apache Cassandra.
The location of the cassandra.yaml file depends on the type of installation:
As shown in the music service example, to filter the data based on the artist, create an index on artist.
Cassandra uses the index to pull out the records in question. An attempt to filter the data before creating
the index will fail because the operation would be very inefficient. A sequential scan across the entire
playlists dataset would be required. After creating the artist index, Cassandra can filter the data in the
playlists table by artist, such as Fu Manchu.
The partition is the unit of replication in Cassandra. In the music service example, partitions are distributed
by hashing the playlist id and using the ring to locate the nodes that store the distributed data. Cassandra
would generally store playlist information on different nodes, and to find all the songs by Fu Manchu,
Cassandra would have to visit different nodes. To avoid these problems, each node indexes its own data.
This technique, however, does not guarantee trouble-free indexing, so know when and when not to use an
index.
76
Database internals
About deletes
How Cassandra deletes data and why deleted data can reappear.
The way Cassandra deletes data differs from the way a relational database deletes data. A relational
database might spend time scanning through data looking for expired data and throwing it away or an
administrator might have to partition expired data by month, for example, to clear it out faster. Data in a
Cassandra column can have an optional expiration date called TTL (time to live). Use CQL to set the TTL
in seconds for data. Cassandra marks TTL data with a tombstone after the requested amount of time has
expired. A tombstone exists for gc_grace_seconds. After data is marked with a tombstone, the data is
automatically removed during the normal compaction process.
Facts about deleted data to keep in mind are:
• Cassandra does not immediately remove data marked for deletion from disk. The deletion occurs during
compaction.
• If you use the size-tiered or date-tiered compaction strategy, you can drop data immediately by
manually starting the compaction process. Before doing so, understand the documented disadvantages
of the process.
• Deleted data can reappear if you do not run node repair routinely.
Why deleted data can reappear
Marking data with a tombstone signals Cassandra to retry sending a delete request to a replica that was
down at the time of delete. If the replica comes back up within the grace period of time, it eventually
receives the delete request. However, if a node is down longer than the grace period, the node can miss
the delete because the tombstone disappears after gc_grace_seconds. Cassandra always attempts to
replay missed updates when the node comes back up again. After a failure, it is a best practice to run node
repair to repair inconsistencies across all of the replicas when bringing a node back into the cluster. If the
node doesn't come back within gc_grace,_seconds, remove the node, wipe it, and bootstrap it again.
77
Database internals
operations, except when a client application uses a consistency level of ANY. You enable or disable hinted
handoff in the cassandra.yaml file.
78
Database internals
Cassandra, configured with a consistency level of ONE, calls the write good because Cassandra can read
the data on node B. When node C comes back up, node A reacts to the hint by forwarding the data to node
C. For more information about how hinted handoff works, see "Modern hinted handoff" by Jonathan Ellis.
Performance
By design, hinted handoff inherently forces Cassandra to continue performing the same number of writes
even when the cluster is operating at reduced capacity. Pushing your cluster to maximum capacity with no
allowance for failures is a bad idea.
Hinted handoff is designed to minimize the extra load on the cluster.
All hints for a given replica are stored under a single partition key, so replaying hints is a simple sequential
read with minimal performance impact.
If a replica node is overloaded or unavailable, and the failure detector has not yet marked it down, then
expect most or all writes to that node to fail after the timeout triggered by write_request_timeout_in_ms,
which defaults to 10 seconds.
If this happens on many nodes at once this could become substantial memory pressure on the coordinator.
So the coordinator tracks how many hints it is currently writing, and if this number gets too high it will
temporarily refuse writes (withOverloadedException) whose replicas include the misbehaving nodes.
79
Database internals
Removal of hints
When removing a node from the cluster by decommissioning the node or by using the nodetool
removenode command, Cassandra automatically removes hints targeting the node that no longer exists.
Cassandra also removes hints for dropped tables.
Reads
How reads work and factors affecting them.
About reads
How Cassandra combines results from the active memtable and potentially mutliple SSTables to satisfy a
read.
To satisfy a read, Cassandra must combine results from the active memtable and potentially mutliple
SSTables.
First, Cassandra checks the Bloom filter. Each SSTable has a Bloom filter associated with it that checks
the probability of having any data for the requested partition in the SSTable before doing any disk I/O.
If the Bloom filter does not rule out the SSTable, Cassandra checks the partition key cache and takes one
of these courses of action:
• If an index entry is found in the cache:
• Cassandra goes to the compression offset map to find the compressed block having the data.
• Fetches the compressed data on disk and returns the result set.
• If an index entry is not found in the cache:
• Cassandra searches the partition summary to determine the approximate location on disk of the
index entry.
• Next, to fetch the index entry, Cassandra hits the disk for the first time, performing a single seek and
a sequential read of columns (a range read) in the SSTable if the columns are contiguous.
• Cassandra goes to the compression offset map to find the compressed block having the data.
• Fetches the compressed data on disk and returns the result set.
80
Database internals
81
Database internals
The red lines in the SSTables in this diagram are fragments of a row that Cassandra needs to combine to
give the user the requested results. Cassandra caches the merged value, not the raw row fragments. That
saves some CPU and disk I/O.
The row cache is not write-through. If a write comes in for the row, the cache for it is invalidated and is not
cached again until it is read.
For rows that are accessed frequently, Cassandra 2.1 has improved the built-in partition key cache and an
optional row cache.
82
Database internals
As a non-relational database, Cassandra does not support joins or foreign keys, and consequently does
not offer consistency in the ACID sense. For example, when moving money from account A to B the total
in the accounts does not change. Cassandra supports atomicity and isolation at the row-level, but trades
transactional isolation and atomicity for high availability and fast write performance. Cassandra writes are
durable.
Atomicity
Everything in a transaction succeeds or the entire transaction is rolled back.
In Cassandra, a write is atomic at the partition-level, meaning inserting or updating columns in a row is
treated as one write operation. A delete operation is also performed atomically. By default, all operations
in a batch are performed atomically. Cassandra uses a batch log to ensure all operations in a batch
are applied atomically. There is a performance penalty for batch atomicity when a batch spans multiple
partitions. If you do not want to incur this penalty, use the UNLOGGED option. Using UNLOGGED makes
the batch operation atomic only within a single partition.
For example, if using a write consistency level of QUORUM with a replication factor of 3, Cassandra will
replicate the write to all nodes in the cluster and wait for acknowledgement from two nodes. If the write fails
on one of the nodes but succeeds on the other, Cassandra reports a failure to replicate the write on that
node. However, the replicated write that succeeds on the other node is not automatically rolled back.
Cassandra uses timestamps to determine the most recent update to a column. Depending on the version
of the Native CQL Protocol, the timestamp is provided by either the client application or the server. The
latest timestamp always wins when requesting data, so if multiple client sessions update the same columns
in a row concurrently, the most recent update is the one seen by readers.
Important: Native CQL Protocol V3 supports client-side timestamps. Be sure to check your client's
documentation to ensure that it generates client-side timestamps and that this feature is activated.
Consistency
A transaction cannot leave the database in an inconsistent state. Cassandra offers different types of
consistency.
Cassandra offers two consistency types:
• Tunable consistency
Availability and consistency can be tuned, and can be strong in the CAP sense--data is made
consistent across all the nodes in a distributed database cluster.
• Linearizable consistency
In ACID terms, linearizable consistency is a serial (immediate) isolation level for lightweight
transactions.
In Cassandra, there are no locking or transactional dependencies when concurrently updating multiple
rows or tables. Tuning availability and consistency always gives you partition tolerance. A user can pick
and choose on a per operation basis how many nodes must receive a DML command or respond to a
SELECT query.
For in-depth information about this new consistency level, see the article, Lightweight transactions in
Cassandra.
To support linearizable consistency, a consistency level of SERIAL has been added to Cassandra.
Additions to CQL have been made to support lightweight transactions.
Isolation
Transactions cannot interfere with each other.
In early versions of Cassandra, it was possible to see partial updates in a row when one user was updating
the row while another user was reading that same row. For example, if one user was writing a row with two
83
Database internals
thousand columns, another user could potentially read that same row and see some of the columns, but
not all of them if the write was still in progress.
Full row-level isolation is in place, which means that writes to a row are isolated to the client performing
the write and are not visible to any other user until they are complete. Delete operations are performed in
isolation. All updates in a batch operation belonging to a given partition key are performed in isolation.
Durability
Completed transactions persist in the event of crashes or server failure.
Writes in Cassandra are durable. All writes to a replica node are recorded both in memory and in a commit
log on disk before they are acknowledged as a success. If a crash or server failure occurs before the
memtables are flushed to disk, the commit log is replayed on restart to recover any lost writes. In addition
to the local durability (data immediately written to disk), the replication of data on other nodes strengthens
durability.
You can manage the local durability to suit your needs for consistency using the commitlog_sync option in
the cassandra.yaml file. Set the option to either periodic or batch.
The location of the cassandra.yaml file depends on the type of installation:
Lightweight transactions
A description about lightweight transactions and when to use them.
Lightweight transactions with linearizable consistency ensure transaction isolation level similar to the
serializable level offered by RDBMS’s. They are also known as compare and set transactions. You use
lightweight transactions instead of durable transactions with eventual/tunable consistency for situations the
require nodes in the distributed system to agree on changes to data. For example, two users attempting
to create a unique user account in the same cluster could overwrite each other’s work. Using a lightweight
transaction, the nodes can agree to create only one account.
Cassandra implements lightweight transactions by extending the Paxos consensus protocol, which is
based on a quorum-based algorithm. Using this protocol, a distributed system can agree on proposed data
additions/modifications without the need for a master database or two-phase commit.
You use extensions in CQL for lightweight transactions.
You can use an IF clause in a number of CQL statements, such as INSERT, for lightweight transactions.
For example, to ensure that an insert into a new accounts table is unique for a new customer, use the IF
NOT EXISTS clause:
DML modifications you make using UPDATE can also make use of the IF clause by comparing one or
more columns to various values:
UPDATE customer_account
SET customer_email=’[email protected]’
IF customerID=’LauraS’;
Cassandra 2.1.1 and later support non-equal conditions for lightweight transactions. You can use <, <=,
>, >=, != and IN operators in WHERE clauses to query lightweight tables. Behind the scenes, Cassandra
is making four round trips between a node proposing a lightweight transaction and any needed replicas
84
Database internals
in the cluster to ensure proper execution so performance is affected. Consequently, reserve lightweight
transactions for those situations where they are absolutely necessary; Cassandra’s normal eventual
consistency can be used for everything else.
A SERIAL consistency level allows reading the current (and possibly uncommitted) state of data without
proposing a new addition or update. If a SERIAL read finds an uncommitted transaction in progress,
Cassandra performs a read repair as part of the commit.
Data consistency
Topics about how up-to-date and synchronized a row of data is on all replicas.
85
Database internals
86
Database internals
87
Database internals
quorum = (sum_of_replication_factors / 2) + 1
The sum of all the replication_factor settings for each data center is the
sum_of_replication_factors.
Examples:
• Using a replication factor of 3, a quorum is 2 nodes. The cluster can tolerate 1 replica down.
• Using a replication factor of 6, a quorum is 4. The cluster can tolerate 2 replicas down.
• In a two data center cluster where each data center has a replication factor of 3, a quorum is 4 nodes.
The cluster can tolerate 2 replica nodes down.
• In a five data center cluster where two data centers have a replication factor of 3 and three data centers
have a replication factor of 2, a quorum is 6 nodes.
The more data centers, the higher number of replica nodes need to respond for a successful operation.
If consistency is a top priority, you can ensure that a read always reflects the most recent write by using the
following formula:
For example, if your application is using the QUORUM consistency level for both write and read operations
and you are using a replication factor of 3, then this ensures that 2 nodes are always written and 2 nodes
are always read. The combination of nodes written and read (4) being greater than the replication factor (3)
ensures strong read consistency.
Similar to QUORUM, the LOCAL_QUORUM level is calculated based on the replication factor of the same data
center as the coordinator node. That is, even if the cluster has more than one data center, the quorum is
calculated only with local replica nodes.
In EACH_QUORUM, every data center in the cluster must reach a quorum based on that data center's
replication factor in order for the read or write request to succeed. That is, for every data center in the
88
Database internals
cluster a quorum of replica nodes must respond to the coordinator node in order for the read or write
request to succeed.
Read requests
The three types of read requests that a coordinator node can send to a replica.
There are three types of read requests that a coordinator can send to a replica:
• A direct read request
• A digest request
• A background read repair request
The coordinator node contacts one replica node with a direct read request. Then the coordinator sends
a digest request to a number of replicas determined by the consistency level specified by the client. The
digest request checks the data in the replica node to make sure it is up to date. Then the coordinator
sends a digest request to all remaining replicas. If any replica nodes have out of date data, a background
read repair request is sent. Read repair requests ensure that the requested row is made consistent on all
replicas.
For a digest request the coordinator first contacts the replicas specified by the consistency level. The
coordinator sends these requests to the replicas that are currently responding the fastest. The nodes
contacted respond with a digest of the requested data; if multiple nodes are contacted, the rows from each
replica are compared in memory to see if they are consistent. If they are not, then the replica that has the
most recent data (based on the timestamp) is used by the coordinator to forward the result back to the
client.
To ensure that all replicas have the most recent version of frequently-read data, the coordinator also
contacts and compares the data from all the remaining replicas that own the row in the background. If the
replicas are inconsistent, the coordinator issues writes to the out-of-date replicas to update the row to the
most recent values. This process is known as read repair. Read repair can be configured per table for
non-QUORUM consistency levels (using read_repair_chance), and is enabled by default.
For illustrated examples of read requests, see the examples of read consistency levels.
89
Database internals
R2
11 3
Client 10 4
coodinator node
9 resends after 5
tim eout
R3
Coordinat or node 8 6
Chosen node 7
1
12 2
R2
11 3
Client 10 4
Coordinat or node 9 5
Chosen node
R3
Read response 8 6
Read repair 7
90
Database internals
Single data center cluster with 3 replica nodes and consistency set to ONE
R1
1
12 2
R2
11 3
Client 10 4
Coordinat or node 9 5
Chosen node R3
Read response 8 6
7
Read repair
1
12 2
R2
11 3
9 5
R3
8 6
7
1
12 2
R1
11 3
R2
10 Dat a Cent er Bet a 4
9 5
Coordinat or node R3
8 6
Chosen node
7
Read response
Read repair
91
Database internals
1
12 2
R2
11 3
9 5
R3
8 6
7
1
12 2
R1
11 3
R2
10 Dat a Cent er Bet a 4
9 5
Coordinat or node R3
8 6
Chosen node
7
Read response
Read repair
Multiple data center cluster with 3 replica nodes and consistency set to ONE
R1
1
12 2
R2
11 3
9 5
R3
8 6
7
1
12 2
R1
11 3
R3
10 Data Center Beta 4
Coordinat or node 9 5
Chosen node R2
Read response 8 6
7
Read repair
92
Database internals
Multiple data center cluster with 3 replica nodes and consistency set to LOCAL_ONE
R1
1
12 2
R2
11 3
9 5
R3
8 6
7
1
12 2
R1
11 3
R3
10 Data Center Beta 4
Coordinat or node 9 5
Chosen node
R2
Read response 8 6
7
Read repair
Write requests
How write requests work.
The coordinator sends a write request to all replicas that own the row being written. As long as all replica
nodes are up and available, they will get the write regardless of the consistency level specified by the
client. The write consistency level determines how many replica nodes must respond with a success
acknowledgment in order for the write to be considered successful. Success means that the data was
written to the commit log and the memtable as described in About writes.
For example, in a single data center 10 node cluster with a replication factor of 3, an incoming write will
go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE,
the first node to complete the write responds back to the coordinator, which then proxies the success
message back to the client. A consistency level of ONE means that it is possible that 2 of the 3 replicas
could miss the write if they happened to be down at the time the request was made. If a replica misses
a write, Cassandra will make the row consistent later using one of its built-in repair mechanisms: hinted
handoff, read repair, or anti-entropy node repair.
That node forwards the write to all replicas of that row. It responds back to the client once it receives a
write acknowledgment from the number of nodes specified by the consistency level.
93
Database internals
Single data center cluster with 3 replica nodes and consistency set to ONE
R1
1
12 2
R2
11 3
Client 10 4
9 5
Coordinat or node
R3
Chosen node 8 6
Writ e response 7
1
12 2
R2
11 3
9 5
R3
8 6
7
1
12 2
R1
11 3
R2
10 Dat a Cent er Bet a 4
9 5
R3
Coordinat or node 8 6
7
Nodes t hat m ake up a quorum
Writ e response
94
Configuration
Configuration
Configuration topics.
95
Configuration
• Generally set to empty. If the node is properly configured (host name, name resolution, and so on),
Cassandra uses InetAddress.getLocalHost() to get the local address from the system.
• For a single node cluster, you can use the default setting (localhost).
• If Cassandra can't find the correct address, you must specify the IP address or host name.
• Never specify 0.0.0.0; it is always wrong.
listen_interface
note
(Default: eth0) The interface that Cassandra binds to for connecting to other Cassandra nodes. Interfaces
must correspond to a single address, IP aliasing is not supported. See listen_address.
Default directories
If you have changed any of the default directories during installation, make sure you have root access and
set these properties:
commitlog_directory
The directory where the commit log is stored. Default locations:
• Package installations: /var/lib/cassandra/commitlog
• Tarball installations: install_location/data/commitlog
For optimal write performance, place the commit log be on a separate disk partition, or (ideally) a separate
physical device from the data file directories. Because the commit log is append only, an HDD for is
acceptable for this purpose.
data_file_directories
The directory location where table data (SSTables) is stored. Cassandra distributes data evenly across the
location, subject to the granularity of the configured compaction strategy. Default locations:
• Package installations: /var/lib/cassandra/data
• Tarball installations: install_location/data/data
As a production best practice, use RAID 0 and SSDs.
saved_caches_directory
The directory location where table key and row caches are stored. Default location:
• Package installations: /var/lib/cassandra/saved_caches
• Tarball installations: install_location/data/saved_caches
96
Configuration
• ignore
Ignore fatal errors and let the batches fail.
disk_failure_policy
(Default: stop) Sets how Cassandra responds to disk failure. Recommend settings are stop or best_effort.
• die
Shut down gossip and Thrift and kill the JVM for any file system errors or single SSTable errors, so the
node can be replaced.
• stop_paranoid
Shut down gossip and Thrift even for single SSTable errors.
• stop
Shut down gossip and Thrift, leaving the node effectively dead, but available for inspection using JMX.
• best_effort
Stop using the failed disk and respond to requests based on the remaining available SSTables. This
means you will see obsolete data at consistency level of ONE.
• ignore
Ignores fatal errors and lets the requests fail; all file system errors are logged but otherwise ignored.
Cassandra acts as in versions prior to 1.2.
Related information: Handling Disk Failures In Cassandra 1.2 blog and Recovering using JBOD.
endpoint_snitch
(Default: org.apache.cassandra.locator.SimpleSnitch) Set to a class that implements the
IEndpointSnitch. Cassandra uses snitches for locating nodes and routing requests.
• SimpleSnitch
Use for single-data center deployments or single-zone in public clouds. Does not recognize data center
or rack information. It treats strategy order as proximity, which can improve cache locality when disabling
read repair.
• GossipingPropertyFileSnitch
Recommended for production. The rack and data center for the local node are defined in the
cassandra-rackdc.properties file and propagated to other nodes via gossip. To allow migration
from the PropertyFileSnitch, it uses the cassandra-topology.properties file if it is present.
• PropertyFileSnitch
Determines proximity by rack and data center, which are explicitly configured in the cassandra-
topology.properties file.
• Ec2Snitch
For EC2 deployments in a single region. Loads region and availability zone information from the EC2
API. The region is treated as the data center and the availability zone as the rack. Uses only private IPs.
Subsequently it does not work across multiple regions.
• Ec2MultiRegionSnitch
Uses public IPs as the broadcast_address to allow cross-region connectivity. This means you must also
set seed addresses to the public IP and open the storage_port or ssl_storage_port on the public IP
firewall. For intra-region traffic, Cassandra switches to the private IP after establishing a connection.
• RackInferringSnitch:
Proximity is determined by rack and data center, which are assumed to correspond to the 3rd and 2nd
octet of each node's IP address, respectively. This snitch is best used as an example for writing a custom
snitch class (unless this happens to match your deployment conventions).
Related information: Snitches
97
Configuration
rpc_address
(Default: localhost) The listen address for client connections (Thrift RPC service and native transport).Valid
values are:
• unset:
Resolves the address using the hostname configuration of the node. If left unset, the hostname must
resolve to the IP address of this node using /etc/hostname, /etc/hosts, or DNS.
• 0.0.0.0:
Listens on all configured interfaces, but you must set the broadcast_rpc_address to a value other than
0.0.0.0.
• IP address
• hostname
Related information: Network
rpc_interface
note
(Default: eth1) The listen address for client connections. Interfaces must correspond to a single address,
IP aliasing is not supported. See rpc_address.
seed_provider
The addresses of hosts deemed contact points. Cassandra nodes use the -seeds list to find each other and
learn the topology of the ring.
• class_name (Default: org.apache.cassandra.locator.SimpleSeedProvider)
The class within Cassandra that handles the seed logic. It can be customized, but this is typically not
required.
• - seeds (Default: 127.0.0.1)
A comma-delimited list of IP addresses used by gossip for bootstrapping new nodes joining a cluster.
When running multiple nodes, you must change the list from the default value. In multiple data-center
clusters, the seed list should include at least one node from each data center (replication group). More
than a single seed node per data center is recommended for fault tolerance. Otherwise, gossip has to
communicate with another data center when bootstrapping a node. Making every node a seed node is not
recommended because of increased maintenance and reduced gossip performance. Gossip optimization
is not critical, but it is recommended to use a small seed list (approximately three nodes per data center).
Related information: Initializing a multiple node cluster (single data center) and Initializing a multiple node
cluster (multiple data centers).
Common compaction settings
compaction_throughput_mb_per_sec
(Default: 16) Throttles compaction to the specified total throughput across the entire system. The faster you
insert data, the faster you need to compact in order to keep the SSTable count down. The recommended
value is 16 to 32 times the rate of write throughput (in MB/second). Setting the value to 0 disables compaction
throttling.
Related information: Configuring compaction
Common memtable settings
memtable_total_space_in_mb
note
(Default: 1/4 of heap) Specifies the total memory used for all memtables on a node. This replaces the
per-table storage settings memtable_operations_in_millions and memtable_throughput_in_mb.
Related information: Tuning the Java heap
Common disk settings
concurrent_reads
98
Configuration
note
(Default: 32) For workloads with more data than can fit in memory, the bottleneck is reads fetching data
from disk. Setting to (16 × number_of_drives) allows operations to queue low enough in the stack so that
the OS and drives can reorder them. The default setting applies to both logical volume managed (LVM)
and RAID drives.
concurrent_writes
note
(Default: 32) Writes in Cassandra are rarely I/O bound, so the ideal number of concurrent writes depends
on the number of CPU cores in your system. The recommended value is 8 × number_of_cpu_cores.
concurrent_counter_writes
note
(Default: 32) Counter writes read the current values before incrementing and writing them back. The
recommended value is (16 × number_of_drives) .
Common automatic backup settings
incremental_backups
(Default: false) Backs up data updated since the last snapshot was taken. When enabled, Cassandra creates
a hard link to each SSTable flushed or streamed locally in a backups/ subdirectory of the keyspace data.
Removing these links is the operator's responsibility.
Related information: Enabling incremental backups
snapshot_before_compaction
(Default: false) Enable or disable taking a snapshot before each compaction. This option is useful to back
up data when there is a data format change. Be careful using this option because Cassandra does not clean
up older snapshots automatically.
Related information: Configuring compaction
Common fault detection setting
phi_convict_threshold
note
(Default: 8) Adjusts the sensitivity of the failure detector on an exponential scale. Generally this setting
never needs adjusting.
Related information: Failure detection and recovery
99
Configuration
potentially include commitlog segments from every table in the system. The default size is usually suitable
for most commitlog archiving, but if you want a finer granularity, 8 or 16 MB is reasonable.
Related information: Commit log archive configuration
commitlog_total_space_in_mb
note
(Default: 32MB for 32-bit JVMs, 8192MB for 64-bit JVMs) Total space used for commitlogs. If the used
space goes above this value, Cassandra rounds up to the next nearest segment multiple and flushes
memtables to disk for the oldest commitlog segments, removing those log segments. This reduces the
amount of data to replay on start-up, and prevents infrequently-updated tables from indefinitely keeping
commitlog segments. A small total commitlog space tends to cause more flush activity on less-active tables.
Related information: Configuring memtable throughput
Compaction settings
Related information: Configuring compaction
compaction_preheat_key_cache
(Default: true) When set to true, cached row keys are tracked during compaction, and re-cached to their
new positions in the compacted SSTable. If you have extremely large key caches for tables, set the value
to false; see Global row and key caches properties.
concurrent_compactors
(Default: Smaller of number of disks or number of cores, with a minimum of 2 and a maximum of 8
note
per CPU core) Sets the number of concurrent compaction processes allowed to run simultaneously
on a node, not including validation compactions for anti-entropy repair. Simultaneous compactions
help preserve read performance in a mixed read-write workload by mitigating the tendency of small
SSTables to accumulate during a single long-running compaction. If your data directories are backed by
SSD, increase this value to the number of cores. If compaction running too slowly or too fast, adjust
compaction_throughput_mb_per_sec first.
in_memory_compaction_limit_in_mb
(Default: 64MB) Size limit for rows being compacted in memory. Larger rows spill to disk and use a
slower two-pass compaction process. When this occurs, a message is logged specifying the row key. The
recommended value is 5 to 10 percent of the available Java heap size.
preheat_kernel_page_cache
(Default: false) Enable or disable kernel page cache preheating from contents of the key cache after
compaction. When enabled it preheats only first page (4KB) of each row to optimize for sequential access.
It can be harmful for fat rows, see CASSANDRA-4937 for more details.
sstable_preemptive_open_interval_in_mb
(Default: 50MB) When compacting, the replacement opens SSTables before they are completely written
and uses in place of the prior SSTables for any range previously written. This setting helps to smoothly
transfer reads between the SSTables by reducing page cache churn and keeps hot rows hot.
Memtable settings
memtable_allocation_type
(Default: heap_buffers) Specify the way Cassandra allocates and manages memtable memory. See Off-
heap memtables in Cassandra 2.1. Options are:
• heap_buffers
On heap NIO (non-blocking I/O) buffers.
• offheap_buffers
Off heap (direct) NIO buffers.
• offheap_objects
Native memory, eliminating NIO buffer heap overhead.
100
Configuration
memtable_cleanup_threshold
note
(Default: 0.11 1/(memtable_flush_writers + 1)) Ratio of occupied non-flushing memtable size to total
permitted size for triggering a flush of the largest memtable. Larger values mean larger flushes and less
compaction, but also less concurrent flush activity, which can make it difficult to keep your disks saturated
under heavy write load.
file_cache_size_in_mb
(Default: Smaller of 1/4 heap or 512) Total memory to use for SSTable-reading buffers.
memtable_flush_queue_size
(Default: 4) The number of full memtables to allow pending flush (memtables waiting for a write thread). At
a minimum, set to the maximum number of indexes created on a single table.
Related information: Flushing data from the memtable
memtable_flush_writers
note
(Default: Smaller of number of disks or number of cores with a minimum of 2 and a maximum of 8) Sets
the number of memtable flush writer threads. These threads are blocked by disk I/O, and each one holds
a memtable in memory while blocked. If your data directories are backed by SSD, increase this setting to
the number of cores.
memtable_heap_space_in_mb
note
(Default: 1/4 heap) Total permitted memory to use for memtables. Triggers a flush based on
memtable_cleanup_threshold. Cassandra stops accepting writes when the limit is exceeded until a flush
completes. If unset, sets to default.
memtable_offheap_space_in_mb
note
(Default: 1/4 heap) See memtable_heap_space_in_mb.
Cache and index settings
column_index_size_in_kb
(Default: 64) Granularity of the index of rows within a partition. For huge rows, decrease this setting to
improve seek time. If you use key cache, be careful not to make this setting too large because key cache
will be overwhelmed. If you're unsure of the size of the rows, it's best to use the default setting.
index_summary_capacity_in_mb
note
(Default: 5% of the heap size [empty]) Fixed memory pool size in MB for SSTable index summaries. If
the memory usage of all index summaries exceeds this limit, any SSTables with low read rates shrink their
index summaries to meet this limit. This is a best-effort process. In extreme conditions, Cassandra may
need to use more than this amount of memory.
index_summary_resize_interval_in_minutes
(Default: 60 minutes) How frequently index summaries should be re-sampled. This is done periodically to
redistribute memory from the fixed-size pool to SSTables proportional their recent read rates. To disable,
set to -1. This leaves existing index summaries at their current sampling level.
reduce_cache_capacity_to
(Default: 0.6) Sets the size percentage to which maximum cache capacity is reduced when Java heap usage
reaches the threshold defined by reduce_cache_sizes_at.
reduce_cache_sizes_at
(Default: 0.85) When Java heap usage (after a full concurrent mark sweep (CMS) garbage collection)
exceeds this percentage, Cassandra reduces the cache capacity to the fraction of the current size as
specified by reduce_cache_capacity_to. To disable, set the value to 1.0.
Disks settings
stream_throughput_outbound_megabits_per_sec
note
(Default: 200 seconds) Throttles all outbound streaming file transfers on a node to the specified
throughput. Cassandra does mostly sequential I/O when streaming data during bootstrap or repair, which
can lead to saturating the network connection and degrading client (RPC) performance.
101
Configuration
inter_dc_stream_throughput_outbound_megabits_per_sec
note
(Default: unset) Throttles all streaming file transfer between the data centers. This setting allows throttles
streaming throughput betweens data centers in addition to throttling all network stream traffic as configured
with stream_throughput_outbound_megabits_per_sec.
trickle_fsync
(Default: false) When doing sequential writing, enabling this option tells fsync to force the operating system
to flush the dirty buffers at a set interval trickle_fsync_interval_in_kb. Enable this parameter to avoid sudden
dirty buffer flushing from impacting read latencies. Recommended to use on SSDs, but not on HDDs.
trickle_fsync_interval_in_kb
(Default: 10240). Sets the size of the fsync in kilobytes.
Advanced properties
Properties for advanced users or properties that are less commonly used.
Advanced initialization properties
auto_bootstrap
(Default: true) This setting has been removed from default configuration. It makes new (non-seed) nodes
automatically migrate the right data to themselves. When initializing a fresh cluster without data, add
auto_bootstrap: false.
Related information: Initializing a multiple node cluster (single data center) and Initializing a multiple node
cluster (multiple data centers).
batch_size_warn_threshold_in_kb
(Default: 5KB per batch) Log WARN on any batch size exceeding this value in kilobytes. Caution should be
taken on increasing the size of this threshold as it can lead to node instability.
broadcast_address
note
(Default: listen_address) The IP address a node tells other nodes in the cluster to contact it by. It allows
public and private address to be different. For example, use the broadcast_address parameter in topologies
where not all nodes have access to other nodes by their private IP addresses.
If your Cassandra cluster is deployed across multiple Amazon EC2 regions and you use the
Ec2MultiRegionSnitch, set the broadcast_address to public IP address of the node and the
listen_address to the private IP. See Ec2MultiRegionSnitch.
initial_token
(Default: disabled) Used in the single-node-per-token architecture, where a node owns exactly one
contiguous range in the ring space. Setting this property overrides num_tokens.
If you not using vnodes or have num_tokens set it to 1 or unspecified (#num_tokens), you should always
specify this parameter when setting up a production cluster for the first time and when adding capacity. For
more information, see this parameter in the Cassandra 1.1 Node and Cluster Configuration documentation.
This parameter can be used with num_tokens (vnodes ) in special cases such as Restoring from a snapshot.
num_tokens
note
(Default: 256) Defines the number of tokens randomly assigned to this node on the ring when using
virtual nodes (vnodes). The more tokens, relative to other nodes, the larger the proportion of data that
the node stores. Generally all nodes should have the same number of tokens assuming equal hardware
capability. The recommended value is 256. If unspecified (#num_tokens), Cassandra uses 1 (equivalent
to #num_tokens : 1) for legacy compatibility and uses the initial_token setting.
If not using vnodes, comment #num_tokens : 256 or set num_tokens : 1 and use initial_token. If
you already have an existing cluster with one token per node and wish to migrate to vnodes, see Enabling
virtual nodes on an existing production cluster.
102
Configuration
Note: If using DataStax Enterprise, the default setting of this property depends on the type of node
and type of install.
partitioner
(Default: org.apache.cassandra.dht.Murmur3Partitioner) Distributes rows (by partition key)
across all nodes in the cluster. Any IPartitioner may be used, including your own as long as it is in the
class path. For new clusters use the default partitioner.
Cassandra provides the following partitioners for backwards compatibility:
• RandomPartitioner
• ByteOrderedPartitioner
• OrderPreservingPartitioner (deprecated)
Related information: Partitioners
storage_port
(Default: 7000) The port for inter-node communication.
Advanced automatic backup setting
auto_snapshot
(Default: true) Enable or disable whether a snapshot is taken of the data before keyspace truncation or
dropping of tables. To prevent data loss, using the default setting is strongly advised. If you set to false,
you will lose data on truncation or drop.
Key caches and global row properties
When creating or modifying tables, you enable or disable the key cache (partition key cache) or row cache
for that table by setting the caching parameter. Other row and key cache tuning and configuration options
are set at the global (node) level. Cassandra uses these settings to automatically distribute memory for
each table on the node based on the overall workload and specific table usage. You can also configure the
save periods for these caches globally.
Related information: Configuring caches
key_cache_keys_to_save
note
(Default: disabled - all keys are saved) Number of keys from the key cache to save.
key_cache_save_period
(Default: 14400 seconds [4 hours]) Duration in seconds that keys are saved in cache. Caches are saved
to saved_caches_directory. Saved caches greatly improve cold-start speeds and has relatively little effect
on I/O.
key_cache_size_in_mb
(Default: empty) A global cache setting for tables. It is the maximum size of the key cache in memory. When
no value is set, the cache is set to the smaller of 5% of the available heap, or 100MB. To disable set to 0.
Related information: setcachecapacity.
row_cache_keys_to_save
note
(Default: disabled - all keys are saved) Number of keys from the row cache to save.
row_cache_size_in_mb
(Default: 0- disabled) Maximum size of the row cache in memory. Row cache can save more time than
key_cache_size_in_mb, but is space-intensive because it contains the entire row. Use the row cache only
for hot rows or static rows. If you reduce the size, you may not get you hottest keys loaded on start up.
row_cache_save_period
(Default: 0- disabled) Duration in seconds that rows are saved in cache. Caches are saved to
saved_caches_directory. This setting has limited use as described in row_cache_size_in_mb.
memory_allocator
103
Configuration
(Default: NativeAllocator) The off-heap memory allocator. In addition to caches, this property affects storage
engine meta data. Supported values:
• NativeAllocator
• JEMallocAllocator
Experiments show that jemalloc saves some memory compared to the native allocator because it is
more fragmentation resistant. To use, install jemalloc as a library and modify cassandra-env.sh
(instructions in file).
Counter caches properties
Counter cache helps to reduce counter locks' contention for hot counter cells. In case of RF = 1 a counter
cache hit will cause Cassandra to skip the read before write entirely. With RF > 1 a counter cache hit will
still help to reduce the duration of the lock hold, helping with hot counter cell updates, but will not allow
skipping the read entirely. Only the local (clock, count) tuple of a counter cell is kept in memory, not the
whole counter, so it's relatively cheap.
Note: Reducing the size counter cache may result in not getting the hottest keys loaded on start-
up.
counter_cache_size_in_mb
note
(Default value: empty) When no value is specified a minimum of 2.5% of Heap or 50MB. If you perform
counter deletes and rely on low gc_grace_seconds, you should disable the counter cache. To disable, set
to 0.
counter_cache_save_period
(Default: 7200 seconds [2 hours]) Duration after which Cassandra should save the counter cache (keys
only). Caches are saved to saved_caches_directory.
counter_cache_keys_to_save
note
(Default value: disabled) Number of keys from the counter cache to save. When disabled all keys are
saved.
Tombstone settings
When executing a scan, within or across a partition, tombstones must be kept in memory to allow returning
them to the coordinator. The coordinator uses them to ensure other replicas know about the deleted rows.
Workloads that generate numerous tombstones may cause performance problems and exhaust the server
heap. See Cassandra anti-patterns: Queues and queue-like datasets. Adjust these thresholds only if you
understand the impact and want to scan more tombstones. Additionally, you can adjust these thresholds at
runtime using the StorageServiceMBean.
Related information: Cassandra anti-patterns: Queues and queue-like datasets
tombstone_warn_threshold
(Default: 1000) The maximum number of tombstones a query can scan before warning.
tombstone_failure_threshold
(Default: 100000) The maximum number of tombstones a query can scan before aborting.
Network timeout settings
range_request_timeout_in_ms
(Default: 10000 milliseconds) The time that the coordinator waits for sequential or index scans to complete.
read_request_timeout_in_ms
(Default: 5000 milliseconds) The time that the coordinator waits for read operations to complete.
counter_write_request_timeout_in_ms
(Default: 5000 milliseconds) The time that the coordinator waits for counter writes to complete.
cas_contention_timeout_in_ms
104
Configuration
(Default: 1000 milliseconds) The time that the coordinator continues to retry a CAS (compare and set)
operation that contends with other proposals for the same row.
truncate_request_timeout_in_ms
(Default: 60000 milliseconds) The time that the coordinator waits for truncates (remove all data from a
table) to complete. The long default value allows for a snapshot to be taken before removing the data. If
auto_snapshot is disabled (not recommended), you can reduce this time.
write_request_timeout_in_ms
(Default: 2000 milliseconds) The time that the coordinator waits for write operations to complete.
Related information: About hinted handoff writes
request_timeout_in_ms
(Default: 10000 milliseconds) The default time for other, miscellaneous operations.
Related information: About hinted handoff writes
Inter-node settings
cross_node_timeout
(Default: false) Enable or disable operation timeout information exchange between nodes (to accurately
measure request timeouts). If disabled Cassandra assumes the request are forwarded to the replica instantly
by the coordinator, which means that under overload conditions extra time is required for processing already-
timed-out requests..
Caution: Before enabling this property make sure NTP (network time protocol) is installed and the
times are synchronized between the nodes.
internode_send_buff_size_in_bytes
note
(Default: N/A) Sets the sending socket buffer size in bytes for inter-node calls.
When setting this parameter and internode_recv_buff_size_in_bytes, the buffer size is limited by
net.core.wmem_max. When unset, buffer size is defined by net.ipv4.tcp_wmem. See man tcp and:
• /proc/sys/net/core/wmem_max
• /proc/sys/net/core/rmem_max
• /proc/sys/net/ipv4/tcp_wmem
• /proc/sys/net/ipv4/tcp_wmem
internode_recv_buff_size_in_bytes
note
(Default: N/A) Sets the receiving socket buffer size in bytes for inter-node calls.
internode_compression
(Default: all) Controls whether traffic between nodes is compressed. The valid values are:
• all
All traffic is compressed.
• dc
Traffic between data centers is compressed.
• none
No compression.
inter_dc_tcp_nodelay
(Default: false) Enable or disable tcp_nodelay for inter-data center communication. When disabled larger,
but fewer, network packets are sent. This reduces overhead from the TCP protocol itself. However, if cross
data-center responses are blocked, it will increase latency.
streaming_socket_timeout_in_ms
105
Configuration
note
(Default: 0 - never timeout streams) Enable or disable socket timeout for streaming operations. When
a timeout occurs during streaming, streaming is retried from the start of the current file. Avoid setting this
value too low, as it can result in a significant amount of data re-streaming.
Native transport (CQL Binary Protocol)
start_native_transport
(Default: true) Enable or disable the native transport server. Uses the same address as the rpc_address,
but the port is different from the rpc_port. See native_transport_port.
native_transport_port
(Default: 9042) Port on which the CQL native transport listens for clients.
native_transport_max_threads
note
(Default: 128) The maximum number of thread handling requests. Similar to rpc_max_threads and differs
as follows:
• Default is different (128 versus unlimited).
• No corresponding native_transport_min_threads.
• Idle threads are stopped after 30 seconds.
native_transport_max_frame_size_in_mb
(Default: 256MB) The maximum size of allowed frame. Frame (requests) larger than this are rejected as
invalid.
RPC (remote procedure call) settings
Settings for configuring and tuning client connections.
broadcast_rpc_address
note
(Default: unset) RPC address to broadcast to drivers and other Cassandra nodes. This cannot be set to
0.0.0.0. If blank, it is set to the value of the rpc_address or rpc_interface. If rpc_address or rpc_interfaceis
set to 0.0.0.0, this property must be set.
rpc_port
(Default: 9160) Thrift port for client connections.
start_rpc
(Default: true) Starts the Thrift RPC server.
rpc_keepalive
(Default: true) Enable or disable keepalive on client connections (RPC or native).
rpc_max_threads
note
(Default: unlimited) Regardless of your choice of RPC server (rpc_server_type), the number of maximum
requests in the RPC thread pool dictates how many concurrent requests are possible. However, if you are
using the parameter sync in the rpc_server_type, it also dictates the number of clients that can be connected.
For a large number of client connections, this could cause excessive memory usage for the thread stack.
Connection pooling on the client side is highly recommended. Setting a maximum thread pool size acts as a
safeguard against misbehaved clients. If the maximum is reached, Cassandra blocks additional connections
until a client disconnects.
rpc_min_threads
note
(Default: 16) Sets the minimum thread pool size for remote procedure calls.
rpc_recv_buff_size_in_bytes
note
(Default: N/A) Sets the receiving socket buffer size for remote procedure calls.
rpc_send_buff_size_in_bytes
note
(Default: N/A) Sets the sending socket buffer size in bytes for remote procedure calls.
rpc_server_type
106
Configuration
(Default: sync) Cassandra provides three options for the RPC server. On Windows, sync is about 30%
slower than hsha. On Linux, sync and hsha performance is about the same, but hsha uses less memory.
• sync: (Default One thread per Thrift connection.)
For a very large number of clients, memory is the limiting factor. On a 64-bit JVM, 180KB is the minimum
stack size per thread and corresponds to your use of virtual memory. Physical memory may be limited
depending on use of stack space.
• hsha:
Half synchronous, half asynchronous. All Thrift clients are handled asynchronously using a small number
of threads that does not vary with the number of clients and thus scales well to many clients. The RPC
requests are synchronous (one thread per active request).
Note: When selecting this option, you must change the default value (unlimited) of
rpc_max_threads.
• Your own RPC server
You must provide a fully-qualified class name of an o.a.c.t.TServerFactory that can create a
server instance.
Advanced fault detection settings
Settings to handle poorly performing or failing nodes.
dynamic_snitch_badness_threshold
(Default: 0.1) Sets the performance threshold for dynamically routing client requests away from a poorly
performing node. Specifically, it controls how much worse a poorly performing node has to be before the
dynamic snitch prefers other replicas over it. A value of 0.2 means Cassandra continues to prefer the static
snitch values until the node response time is 20% worse than the best performing node. Until the threshold
is reached, incoming requests are statically routed to the closest replica (as determined by the snitch). If
the value of this parameter is greater than zero and read_repair_chance is less than 1.0, cache capacity
is maximized across the nodes.
dynamic_snitch_reset_interval_in_ms
(Default: 600000 milliseconds) Time interval to reset all node scores, which allows a bad node to recover.
dynamic_snitch_update_interval_in_ms
(Default: 100 milliseconds) The time interval for how often the snitch calculates node scores. Because score
calculation is CPU intensive, be careful when reducing this interval.
hinted_handoff_enabled
(Default: true) Enable or disable hinted handoff. To enable per data center, add data center list. For
example: hinted_handoff_enabled: DC1,DC2. A hint indicates that the write needs to be replayed to
an unavailable node. Where Cassandra writes the hint depends on the version:
• Prior to 1.0
Writes to a live replica node.
• 1.0 and later
Writes to the coordinator node.
Related information: About hinted handoff writes
hinted_handoff_throttle_in_kb
(Default: 1024) Maximum throttle per delivery thread in kilobytes per second. This rate reduces proportionally
to the number of nodes in the cluster. For example, if there are two nodes in the cluster, each delivery thread
will use the maximum rate. If there are three, each node will throttle to half of the maximum, since the two
nodes are expected to deliver hints simultaneously.
max_hint_window_in_ms
107
Configuration
(Default: 10800000 milliseconds [3 hours]) Maximum amount of time that hints are generates hints for an
unresponsive node. After this interval, new hints are no longer generated until the node is back up and
responsive. If the node goes down again, a new interval begins. This setting can prevent a sudden demand
for resources when a node is brought back online and the rest of the cluster attempts to replay a large
volume of hinted writes.
Related information: Failure detection and recovery
max_hints_delivery_threads
(Default: 2) Number of threads with which to deliver hints. In multiple data-center deployments, consider
increasing this number because cross data-center handoff is generally slower.
batchlog_replay_throttle_in_kb
(Default: 1024KB per second) Total maximum throttle. Throttling is reduced proportionally to the number
of nodes in the cluster.
Request scheduler properties
Settings to handle incoming client requests according to a defined policy. If you need to use these
properties, your nodes are overloaded and dropping requests. It is recommended that you add more nodes
and not try to prioritize requests.
request_scheduler
(Default: org.apache.cassandra.scheduler.NoScheduler) Defines a scheduler to handle incoming
client requests according to a defined policy. This scheduler is useful for throttling client requests in single
clusters containing multiple keyspaces. This parameter is specifically for requests from the client and does
not affect inter-node communication. Valid values are:
• org.apache.cassandra.scheduler.NoScheduler
No scheduling takes place.
• org.apache.cassandra.scheduler.RoundRobinScheduler
Round robin of client requests to a node with a separate queue for each request_scheduler_id property.
• A Java class that implements the RequestScheduler interface.
request_scheduler_id
note
(Default: keyspace) An identifier on which to perform request scheduling. Currently the only valid value
is keyspace. See weights.
request_scheduler_options
(Default: disabled) Contains a list of properties that define configuration options for request_scheduler:
• throttle_limit
The number of in-flight requests per client. Requests beyond this limit are queued up until running
requests complete. Recommended value is ((concurrent_reads + concurrent_writes) × 2).
note
• default_weight: (Default: 1)
How many requests are handled during each turn of the RoundRobin.
• weights: (Default: Keyspace: 1)
Takes a list of keyspaces. It sets how many requests are handled during each turn of the RoundRobin,
based on the request_scheduler_id.
Thrift interface properties
Legacy API for older clients. CQL is a simpler and better API for Cassandra.
thrift_framed_transport_size_in_mb
(Default: 15) Frame size (maximum field length) for Thrift. The frame is the row or part of the row that the
application is inserting.
thrift_max_message_length_in_mb
108
Configuration
(Default: 16) The maximum length of a Thrift message in megabytes, including all fields and internal Thrift
overhead (1 byte of overhead for each frame). Message length is usually used in conjunction with batches.
A frame length greater than or equal to 24 accommodates a batch with four inserts, each of which is 24
bytes. The required message length is greater than or equal to 24+24+24+24+4 (number of frames).
Security properties
Server and client security settings.
authenticator
(Default: AllowAllAuthenticator) The authentication backend. It implements IAuthenticator for
identifying users. The available authenticators are:
• AllowAllAuthenticator:
Disables authentication; no checks are performed.
• PasswordAuthenticator
Authenticates users with user names and hashed passwords stored in the system_auth.credentials table.
If you use the default, 1, and the node with the lone replica goes down, you will not be able to log into
the cluster because the system_auth keyspace was not replicated.
Related information: Internal authentication
internode_authenticator
note
(Default: enabled) Internode authentication backend. It implements
org.apache.cassandra.auth.AllowAllInternodeAuthenticator to allows or disallow
connections from peer nodes.
authorizer
(Default: AllowAllAuthorizer) The authorization backend. It implements IAuthenticator to limit access
and provide permissions. The available authorizers are:
• AllowAllAuthorizer
Disables authorization; allows any action to any user.
• CassandraAuthorizer
Stores permissions in system_auth.permissions table. If you use the default, 1, and the node with the
lone replica goes down, you will not be able to log into the cluster because the system_auth keyspace
was not replicated.
Related information: Object permissions
permissions_validity_in_ms
(Default: 2000) How long permissions in cache remain valid. Depending on the authorizer, such as
CassandraAuthorizer, fetching permissions can be resource intensive. This setting disabled when set
to 0 or when AllowAllAuthorizer is set.
Related information: Object permissions
permissions_update_interval_in_ms
note
(Default: same value as permissions_validity_in_ms) Refresh interval for permissions cache (if enabled).
After this interval, cache entries become eligible for refresh. On next access, an async reload is scheduled
and the old value is returned until it completes. If permissions_validity_in_ms , then this property must
benon-zero.
server_encryption_options
Enable or disable inter-node encryption. You must also generate keys and provide the appropriate key and
trust store locations and passwords. No custom encryption options are currently enabled. The available
options are:
109
Configuration
110
Configuration
Password for the keystore. This must match the password used when generating the keystore and
truststore.
• require_client_auth: (Default: false)
Enables or disables certificate authentication. (Available starting with Cassandra 1.2.3.)
• truststore: (Default: conf/.truststore)
Set if require_client_auth is true.
• truststore_password: <truststore_password>
Set if require_client_auth is true.
The advanced settings are:
• protocol: (Default: TLS)
• algorithm: (Default: SunX509)
• store_type: (Default: JKS)
• cipher_suites: (Default:
TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA)
Related information: Client-to-node encryption
ssl_storage_port
(Default: 7001) The SSL port for encrypted communication. Unused unless enabled in encryption_options.
Procedure
In the cassandra.yaml file, set the following parameters:
Property Description
cluster_name Name of the cluster that this node is joining. Must
be the same for every node in the cluster.
listen_address The IP address or hostname that Cassandra binds
to for connecting to other Cassandra nodes.
(Optional) broadcast_address The IP address a node tells other nodes in the
cluster to contact it by. It allows public and private
address to be different. For example, use the
broadcast_address parameter in topologies
where not all nodes have access to other nodes
111
Configuration
Property Description
by their private IP addresses. The default is the
listen_address.
seed_provider A -seeds list is comma-delimited list of hosts
(IP addresses) that gossip uses to learn the
topology of the ring. Every node should have
the same list of seeds. In multiple data-center
clusters, the seed list should include at least one
node from each data center (replication group).
More than a single seed node per data center
is recommended for fault tolerance. Otherwise,
gossip has to communicate with another data
center when bootstrapping a node. Making every
node a seed node is not recommended because
of increased maintenance and reduced gossip
performance. Gossip optimization is not critical,
but it is recommended to use a small seed list
(approximately three nodes per data center).
storage_port The inter-node communication port (default is
7000). Must be the same for every node in the
cluster.
initial_token For legacy clusters. Used in the single-node-per-
token architecture, where a node owns exactly
one contiguous range in the ring space.
num_tokens For new clusters. Defines the number of tokens
randomly assigned to this node on the ring when
using virtual nodes (vnodes).
Base the size of the directory on the value of the Java -mx option.
112
Configuration
Procedure
Set the location of the heap dump in the cassandra-env.sh file.
1. Open the cassandra-env.sh file for editing.
3. On the line after the comment, set the CASSANDRA_HEAPDUMP_DIR to the path you want to use:
Procedure
Set the number of tokens on each node in your cluster with the num_tokens parameter in the
cassandra.yaml file.
The recommended value is 256. Do not set the initial_token parameter.
Procedure
1. Add a new data center to the cluster.
113
Configuration
2. Once the new data center with vnodes enabled is up, switch your clients to use the new data center.
3. Run a full repair with nodetool repair.
This step ensures that after you move the client to the new data center that any previous writes are
added to the new data center and that nothing else, such as hints, is dropped when you remove the old
data center.
4. Update your schema to no longer reference the old data center.
5. Remove the old data center from the cluster.
See Decommissioning a data center.
Note: Do not make all nodes seeds, see Internode communications (gossip).
3. Be sure that the storage_port or ssl_storage_port is open on the public IP firewall.
Caution: Be sure to enable encryption and authentication when using public IP's. See Node-to-
node encryption. Another option is to use a custom VPN to have local, inter-region/datacenter IP's.
114
Configuration
In the example below, there are two cassandra data centers and each data center is named for its
workload. The data center naming convention in this example is based on the workload. You can use other
conventions, such as DC1, DC2 or 100, 200. (Data center names are case-sensitive.)
Network A Network B
Node and data center: Node and data center:
• node0 • node0
dc=DC_A_cassandra dc=DC_A_cassandra
rack=RAC1 rack=RAC1
• node1 • node1
dc=DC_A_cassandra dc=DC_A_cassandra
rack=RAC1 rack=RAC1
• node2 • node2
dc=DC_B_cassandra dc=DC_B_cassandra
rack=RAC1 rack=RAC1
• node3 • node3
dc=DC_B_cassandra dc=DC_B_cassandra
rack=RAC1 rack=RAC1
• node4 • node4
dc=DC_A_analytics dc=DC_A_analytics
rack=RAC1 rack=RAC1
• node5 • node5
dc=DC_A_search dc=DC_A_search
rack=RAC1 rack=RAC1
115
Configuration
us-east_1_cassandra us-west_1_cassandra
us-east_2_cassandra us-west_2_cassandra
us-east_1_analytics us-west_1_analytics
us-east_1_search us-west_1_search
Configuring logging
About Cassandra logging functionality using Simple Logging Facade for Java (SLF4J) with a logback
backend.
Cassandra provides logging functionality using Simple Logging Facade for Java (SLF4J) with a logback
backend. Logs are written to the system.log file in the Cassandra logging directory. You can configure
logging programmatically or manually. Manual ways to configure logging are:
• Run the nodetool setlogginglevel command
• Configure the logback-test.xml or logback.xml file installed with Cassandra
• Use the JConsole tool to configure logging through JMX.
Logback looks for logback-test.xml first, and then for logback.xml. The logback.xml location for
different types of installations is listed in the "File locations" section. For example, on tarball and source
installations, logback.xml is located in the install_location/conf directory.
The XML configuration files look something like this:
<configuration scan="true">
<jmxConfigurator />
<appender name="FILE"
class="ch.qos.logback.core.rolling.RollingFileAppender">
<file>${cassandra.logdir}/system.log</file>
<rollingPolicy
class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
<fileNamePattern>${cassandra.logdir}/system.log.%i.zip</
fileNamePattern>
<minIndex>1</minIndex>
<maxIndex>20</maxIndex>
</rollingPolicy>
116
Configuration
<triggeringPolicy
class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
<maxFileSize>20MB</maxFileSize>
</triggeringPolicy>
<encoder>
<pattern>%-5level [%thread] %date{ISO8601} %F:%L - %msg%n</pattern>
<!-- old-style log format
<pattern>%5level [%thread] %date{ISO8601} %F (line %L) %msg%n</
pattern>
-->
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="FILE" />
<appender-ref ref="STDOUT" />
</root>
The appender configurations specify where to print the log and its configuration. The first appender directs
logs to a file. The second appender directs logs to the console. You can change the following logging
functionality:
• Rolling policy
• The policy for rolling logs over to an archive
• Location and name of the log file
• Location and name of the archive
• Minimum and maximum file size to trigger rolling
• Format of the message
• The log level
Log levels
The valid values for setting the log level include ALL for logging information at all levels, TRACE through
ERROR, and OFF for no logging. TRACE creates the most verbose log, and ERROR, the least.
• ALL
• TRACE
• DEBUG
• INFO (Default)
• WARN
• ERROR
• OFF
Note: Increasing logging levels can generate heavy logging output on a moderately trafficked
cluster.
You can use the nodetool getlogginglevels command to see the current logging configuration.
$ nodetool getlogginglevels
Logger Name Log Level
ROOT INFO
117
Configuration
com.thinkaurelius.thrift ERROR
Procedure
• Archive a commitlog segment:
Command archive_command=
Parameters %path Fully qualified path of the segment to
archive.
%name Name of the commit log.
Example archive_command=/bin/ln %path /
backup/%name
• Restore an archived commitlog:
Command restore_command=
Parameters %from Fully qualified path of the an archived
commitlog segment from the
restore_directories.
%t Name of live commit log directory.
Example restore_command=cp -f %from %to
• Set the restore directory location:
118
Configuration
Command restore_directories=
Format restore_directories=restore_directory_location
• Restore mutations created up to and including the specified timestamp:
Command restore_point_in_time=
Format <timestamp> (YYYY:MM:DD HH:MM:SS)
Example restore_point_in_time=2013:12:11
17:00:00
Restore stops when the first client-supplied timestamp is greater than the restore point timestamp.
Because the order in which Cassandra receives mutations does not strictly follow the timestamp order,
this can leave some mutations unrecovered.
Generating tokens
If not using virtual nodes (vnodes), you must calculate tokens for your cluster.
If not using virtual nodes (vnodes), you must calculate tokens for your cluster.
The following topics in the Cassandra 1.1 documentation provide conceptual information about tokens:
• Data Distribution in the Ring
• Replication Strategy
About calculating tokens for single or multiple data centers in Cassandra 1.2 and later
• Single data center deployments: calculate tokens by dividing the hash range by the number of nodes in
the cluster.
• Multiple data center deployments: calculate the tokens for each data center so that the hash range is
evenly divided for the nodes in each data center.
For more explanation, see be sure to read the conceptual information mentioned above.
The method used for calculating tokens depends on the type of partitioner:
119
Configuration
Hadoop support
Cassandra support for integrating Hadoop with Cassandra.
Cassandra support for integrating Hadoop with Cassandra includes:
• MapReduce
• Apache Pig
You can use Cassandra 2.1 with Hadoop 2.x or 1.x with some restrictions.
• Isolate Cassandra and Hadoop nodes in separate data centers.
• Before starting the data centers of Cassandra/Hadoop nodes, disable virtual nodes (vnodes).
To disable virtual nodes:
1. In the cassandra.yaml file, set num_tokens to 1.
2. Uncomment the initial_token property and set it to 1 or to the value of a generated token for a multi-
node cluster.
3. Start the cluster for the first time.
Do not disable or enable vnodes on an existing cluster.
Setup and configuration, described in the Apache docs, involves overlaying a Hadoop cluster on
Cassandra nodes, configuring a separate server for the Hadoop NameNode/JobTracker, and installing
a Hadoop TaskTracker and Data Node on each Cassandra node. The nodes in the Cassandra data
center can draw from data in the HDFS Data Node as well as from Cassandra. The Job Tracker/Resource
Manager (JT/RM) receives MapReduce input from the client application. The JT/RM sends a MapReduce
job request to the Task Trackers/Node Managers (TT/NM) and optional clients, MapReduce and Pig. The
data is written to Cassandra and results sent back to the client.
120
Configuration
The Apache docs also cover how to get configuration and integration support.
121
Configuration
122
Operations
Operations
Operation topics.
Monitoring Cassandra
Monitoring topics.
$ nodetool proxyhistograms
proxy histograms
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 1502.50 375.00 446.00
75% 1714.75 420.00 498.00
95% 31210.25 507.00 800.20
98% 36365.00 577.36 948.40
99% 36365.00 740.60 1024.39
Min 616.00 230.00 311.00
Max 36365.00 55726.00 59247.00
For a summary of the ring and its current state of general health, use the status command. For example:
$ nodetool status
123
Operations
The nodetool utility provides commands for viewing detailed metrics for tables, server metrics, and
compaction statistics:
• nodetool cfstats displays statistics for each table and keyspace.
• nodetool cfhistograms provides statistics about a table, including read/write latency, row size,
column count, and number of SSTables.
• nodetool netstats provides statistics about network operations and connections.
• nodetool tpstats provides statistics about the number of active, pending, and completed tasks for
each stage of Cassandra operations by thread pool.
DataStax OpsCenter
DataStax OpsCenter is a graphical user interface for monitoring and administering all nodes in a
Cassandra cluster from one centralized console. DataStax OpsCenter is bundled with DataStax support
offerings. You can register for a free version for development or non-production use.
OpsCenter provides a graphical representation of performance trends in a summary view that is hard
to obtain with other monitoring tools. The GUI provides views for different time periods as well as
the capability to drill down on single data points. Both real-time and historical performance data for a
Cassandra or DataStax Enterprise cluster are available in OpsCenter. OpsCenter metrics are captured and
stored within Cassandra.
Within OpsCenter you can customize the performance metrics viewed to meet your monitoring needs.
Administrators can also perform routine node administration tasks from OpsCenter. Metrics within
OpsCenter are divided into three general categories: table metrics, cluster metrics, and OS metrics. For
many of the available metrics, you can view aggregated cluster-wide information or view information on a
per-node basis.
124
Operations
125
Operations
If you choose to monitor Cassandra using JConsole, keep in mind that JConsole consumes a significant
amount of system resources. For this reason, DataStax recommends running JConsole on a remote
machine rather than on the same host as a Cassandra node.
The JConsole CompactionManagerMBean exposes compaction metrics that can indicate when you need
to add capacity to your cluster.
Compaction metrics
Monitoring compaction performance is an important aspect of knowing when to add capacity to your
cluster.
Monitoring compaction performance is an important aspect of knowing when to add capacity to your
cluster. The following attributes are exposed through CompactionManagerMBean:
Attribute Description
CompletedTasks Number of completed compactions since the last start of this Cassandra
instance
PendingTasks Number of estimated tasks remaining to perform
ColumnFamilyInProgress The table currently being compacted. This attribute is null if no compactions
are in progress.
BytesTotalInProgress Total number of data bytes (index and filter are not included) being
compacted. This attribute is null if no compactions are in progress.
BytesCompacted The progress of the current compaction. This attribute is null if no compactions
are in progress.
126
Operations
Table statistics
Compaction metrics provide a number of statistics that are important for monitoring performance trends.
For individual tables, ColumnFamilyStoreMBean provides the same general latency attributes as
StorageProxyMBean. Unlike StorageProxyMBean, ColumnFamilyStoreMBean has a number of other
statistics that are important to monitor for performance trends. The most important of these are:
Attribute Description
MemtableDataSize The total size consumed by this table's data (not including metadata).
MemtableColumnsCount Returns the total number of columns present in the memtable (across
all keys).
MemtableSwitchCount How many times the memtable has been flushed out.
RecentReadLatencyMicros The average read latency since the last call to this bean.
RecentWriterLatencyMicros The average write latency since the last call to this bean.
LiveSSTableCount The number of live SSTables for this table.
127
Operations
The recent read latency and write latency counters are important in making sure operations are happening
in a consistent manner. If these counters start to increase after a period of staying flat, you probably need
to add capacity to the cluster.
You can set a threshold and monitor LiveSSTableCount to ensure that the number of SSTables for a given
table does not become too great.
After updating the value of bloom_filter_fp_chance on a table, Bloom filters need to be regenerated in one
of these ways:
• Initiate compaction
• Upgrade SSTables
You do not have to restart Cassandra after regenerating SSTables.
Data caching
Data caching topics.
128
Operations
In Cassandra 2.1, the saved key cache files include the ID of the table in the file name. A saved key cache
file name for the users table in the mykeyspace keyspace in a Cassandra 2.1 looks something like this:
mykeyspace-users.users_name_idx-19bd7f80352c11e4aa6a57448213f97f-KeyCache-
b.db2046071785672832311.tmp
You can configure partial or full caching of each partition by setting the rows_per_partition table option.
Previously, the caching mechanism put the entire partition in memory. If the partition was larger than the
cache size, Cassandra never read the data from the cache. Now, you can specify the number of rows to
cache per partition to increase cache hits. You configure the cache using the CQL caching property.
Procedure
Set the table caching property that configures the partition key cache and the row cache.
129
Operations
One read operation hits the row cache, returning the requested row without a disk seek. The other read
operation requests a row that is not present in the row cache but is present in the partition key cache. After
accessing the row in the SSTable, the system returns the data and populates the row cache with this read
operation.
In subsequent queries for the same partition, look for a line in the trace that looks something like this:
This output means the data was found in the cache and no disk read occurred. Updates invalidate the
cache. If you query rows in the cache plus uncached rows, request more rows than the global limit allows,
130
Operations
or the query does not grab the beginning of the partition, the trace might include a line that looks something
like this:
Ignoring row cache as cached value could not satisfy query [ReadStage:89]
This output indicates that an insufficient cache caused a disk read. Requesting rows not at the beginning
of the partition is a likely cause. Try removing constraints that might cause the query to skip the beginning
of the partition, or place a limit on the query to prevent results from overflowing the cache. To ensure that
the query hits the cache, try increasing the cache size limit, or restructure the table to position frequently
accessed rows at the head of the partition.
ID : 387d15ba-7103-491b-9327-1a691dbb504a
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 65.87 KB
Generation No : 1400189757
Uptime (seconds) : 148760
Heap Memory (MB) : 392.82 / 1996.81
Data Center : datacenter1
Rack : rack1
Exceptions : 0
Key Cache : entries 10, size 728 (bytes), capacity 103809024 (bytes),
93 hits, 102 requests, 0.912 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 (bytes), capacity 0 (bytes), 0 hits, 0
requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 (bytes), capacity 51380224 (bytes), 0
hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token : -9223372036854775808
131
Operations
Configuring compaction
Steps for configuring compaction. The compaction process merges keys, combines columns, evicts
tombstones, consolidates SSTables, and creates a new index in the merged SSTable.
132
Operations
store data that has been set to expire using TTL in an SSTable with other data scheduled to expire at
approximately the same time. Cassandra can then drop the SSTable without doing any compaction.
Also see DTCS compaction subproperties and DateTieredCompactionStrategy: Compaction for Time
Series Data.
Note: When using DTCS disabling read repair is recommended. Use full repair as necessary.
• LeveledCompactionStrategy (LCS): The leveled compaction strategy creates SSTables of a
fixed, relatively small size (160 MB by default) that are grouped into levels. Within each level, SSTables
are guaranteed to be non-overlapping. Each level (L0, L1, L2 and so on) is 10 times as large as the
previous. Disk I/O is more uniform and predictable on higher than on lower levels as SSTables are
continuously being compacted into progressively larger levels. At each level, row keys are merged
into non-overlapping SSTables. This can improve performance for reads, because Cassandra can
determine which SSTables in each level to check for the existence of row key data. This compaction
strategy is modeled after Google's leveldb implementation. Also see LCS compaction subproperties.
To configure the compaction strategy property and CQL compaction subproperties, such as the maximum
number of SSTables to compact and minimum SSTable size, use CREATE TABLE or ALTER TABLE.
The location of the cassandra.yaml file depends on the type of installation:
Procedure
1. Update a table to set the compaction strategy using the ALTER TABLE statement.
2. Change the compaction strategy property to SizeTieredCompactionStrategy and specify the minimum
number of SSTables to trigger a compaction using the CQL min_threshold attribute.
Results
You can monitor the results of your configuration using compaction metrics, see Compaction metrics.
Compression
Compression maximizes the storage capacity of Cassandra nodes by reducing the volume of data on disk
and disk I/O, particularly for read-dominated workloads.
Compression maximizes the storage capacity of Cassandra nodes by reducing the volume of data on disk
and disk I/O, particularly for read-dominated workloads. Cassandra quickly finds the location of rows in the
SSTable index and decompresses the relevant row chunks.
Write performance is not negatively impacted by compression in Cassandra as it is in traditional
databases. In traditional relational databases, writes require overwrites to existing data files on disk. The
database has to locate the relevant pages on disk, decompress them, overwrite the relevant data, and
finally recompress. In a relational database, compression is an expensive operation in terms of CPU cycles
and disk I/O. Because Cassandra SSTable data files are immutable (they are not written to again after
they have been flushed to disk), there is no recompression cycle necessary in order to process writes.
133
Operations
SSTables are compressed only once when they are written to disk. Writes on compressed tables can show
up to a 10 percent performance improvement.
Configuring compression
Steps for configuring compression.
Procedure
1. Disable compression, using CQL to set the compression parameters to an empty string.
2. Enable compression on an existing table, using ALTER TABLE to set the compression algorithm
sstable_compression to LZ4Compressor (Cassandra 1.2.2 and later), SnappyCompressor, or
DeflateCompressor.
134
Operations
3. Change compression on an existing table, using ALTER TABLE and setting the compression algorithm
sstable_compression to DeflateCompressor.
You tune data compression on a per-table basis using CQL to alter a table.
Procedure
Enable write survey mode by starting a Cassandra node using the write_survey option.
$ bin/cassandra – Dcassandra.write_survey=true
This example shows how to start a tarball installation of Cassandra.
135
Operations
• MAX_HEAP_SIZE
Sets the maximum heap size for the JVM. The same value is also used for the minimum heap size. This
allows the heap to be locked in memory at process start to keep it from being swapped out by the OS.
• HEAP_NEWSIZE
The size of the young generation. The larger this is, the longer GC pause times will be. The shorter it is,
the more expensive GC will be (usually). A good guideline is 100 MB per CPU core.
Many users new to Cassandra are tempted to turn up Java heap size too high, which consumes the
majority of the underlying system's RAM. In most cases, increasing the Java heap size is actually
detrimental for these reasons:
• In most cases, the capability of Java to gracefully handle garbage collection above 8GB quickly
diminishes.
• Modern operating systems maintain the OS page cache for frequently accessed data and are very good
at keeping this data in memory, but can be prevented from doing its job by an elevated Java heap size.
If you have more than 2GB of system memory, which is typical, keep the size of the Java heap relatively
small to allow more memory for the page cache.
Some Solr users have reported that increasing the stack size improves performance under Tomcat. To
increase the stack size, uncomment and modify the default setting in the cassandra-env.sh file. Also,
decreasing the memtable space to make room for Solr caches might improve performance. Modify the
memtable space using the memtable_total_space_in_mb property in the cassandra.yaml file.
Because MapReduce runs outside the JVM, changes to the JVM do not affect Analytics/Hadoop
operations directly.
136
Operations
JMX options
Cassandra exposes a number of statistics and management operations via Java Management Extensions
(JMX). Java Management Extensions (JMX) is a Java technology that supplies tools for managing and
monitoring Java applications and services. Any statistic or operation that a Java application has exposed
as an MBean can then be monitored or manipulated using JMX. JConsole, the nodetool utility, and
DataStax OpsCenter are examples of JMX-compliant management tools.
By default, you can modify the following properties in the conf/cassandra-env.sh file to configure JMX to
listen on port 7199 without authentication.
• com.sun.management.jmxremote.port
The port on which Cassandra listens from JMX connections.
• com.sun.management.jmxremote.ssl
Enable/disable SSL for JMX.
• com.sun.management.jmxremote.authenticate
Enable/disable remote authentication for JMX.
• -Djava.rmi.server.hostname
Sets the interface hostname or IP that JMX should use to connect. Uncomment and set if you are
having trouble connecting.
The location of the cassandra.yaml file depends on the type of installation:
Procedure
In the unlikely event you need to correct a problem in the gossip state:
1. Using MX4J or JConsole, connect to the node's JMX port and then use the JMX method
Gossiper.unsafeAssassinateEndpoints(ip_address) to assassinate the problem node.
This takes a few seconds to complete so wait for confirmation that the node is deleted.
2. If the JMX method above doesn't solve the problem, stop your client application from sending writes to
the cluster.
3. Take the entire cluster offline:
137
Operations
$ sudo rm -r /var/lib/cassandra/data/system/peers/*
5. Clear the gossip state when the node starts:
• For tarball installations, you can use a command line option or edit the cassandra-env.sh. To use
the command line:
$ install_location/bin/cassandra -Dcassandra.load_ring_state=false
• For package installations or if you are not using the command line option above, add the following
line to the cassandra-env.sh file:
JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"
$ cd install_location
$ bin/cassandra
What to do next
Remove the line you added in the cassandra-env.sh file.
Repairing nodes
Node repair makes data on a replica consistent with data on other nodes.
Node repair makes data on a replica consistent with data on other nodes. Anti-entropy node repairs are
important for every Cassandra cluster. Frequent data deletions and a node going down are common
causes of inconsistency. You use the nodetool repair command to repair a node. There are several types
of node repair:
• Sequential - One node is repaired after another, and done in full, all SSTables are repaired (default).
• Incremental - Persists already repaired data, which allows the repair process to stay performant and
lightweight as datasets grow providing repairs are run frequently.
• Partitioner range - Repairs only the first range returned by the partitioner for a node. This repair type
operates on each node in the cluster in succession without duplicating work
You can combine repair options, such as parallel and incremental repair. This combination does an
incremental repair to all nodes at the same time. You can restrict repairs to local or other data centers or to
138
Operations
nodes between a certain token range. You can specify which nodes have the good data for replacing the
outdated data.
Incremental repair
The repair process involves computing a Merkle tree for each range of data on that node. The Merkle tree
is a binary tree of hashes used by Cassandra for calculating the differences in datasets between nodes in
a cluster. Each node involved in the repair has to construct its Merkle tree from all the SSTables it stores,
making the calculation resource intensive.
To reduce the expense of constructing trees, Cassandra 2.1 introduces incremental repair. An incremental
repair makes already repaired data persistent, and only calculates a Merkle tree for unrepaired SSTables.
Reducing the size of the Merkle tree improves the performance of the incremental repair process,
assuming repairs are run frequently.
Incremental repairs begin with the repair leader sending out a prepare message to its peers. Each node
builds a Merkle tree from the unrepaired SSTables. After the leader receives a Merkle tree from each node,
the leader compares the trees and issues streaming requests. Finally, the leader issues an anticompaction
command.
For more information about incremental repairs, see the "More efficient repairs" article.
Anticompaction
Anticompaction in Cassandra 2.1 is the process of segregating repaired and unrepaired ranges into
separate SSTables unless the SSTable fits entirely within the repaired range. If the SSTable fits within the
repaired range, Cassandra just updates the SSTable metadata.
Anticompaction occurs after an incremental repair. Cassandra performs anticompaction only on the
SSTables that have a range of unrepaired data. If all node ranges are repaired, anticompaction does not
139
Operations
need to rewrite any data. During anticompaction, size/date-tiered compaction and leveled compaction
handle the segregation of the data differently.
• Size-tiered and date-tiered compaction splits repaired and unrepaired into separate pools for separate
compactions. A major compaction generates two SSTables, one containing repaired data and one
containing unrepaired data.
• Leveled compaction performs size-tiered compaction on unrepaired data. After repair completes,
Cassandra moves data from the set of unrepaired SSTables to L0.
Repairing a subrange
To mitigate overstreaming, you can use subrange repair, but generally subrange repairs are not
recommended. A subrange repair does a portion of the data belonging to the node. Because the Merkle
tree precision is fixed, this effectively increases the overall precision.
To use subrange repair:
1. Use the Java describe_splits call to ask for a split containing 32K partitions.
2. Iterate throughout the entire range incrementally or in parallel. This completely eliminates the
overstreaming behavior and wasted disk usage overhead.
3. Pass the tokens you received for the split to the nodetool repair -st (–start-token) and -et (–end-
token) options.
4. Pass the -local (–in-local-dc) option to nodetool to repair only within the local data center. This reduces
the cross data-center transfer load.
Repair guidelines
Run repair in these situations:
140
Operations
• Daily if you run incremental repairs, weekly if you run full repairs.
Note: Even if deletions never occur, schedule regular repairs. Setting a column to null is a
delete.
• During node recovery. For example, when bringing a node back into the cluster after a failure.
• On nodes containing data that is not read frequently.
• To update data on a node that has been down.
• To recover missing data or corrupted SSTables. A non-incremental is required.
• To minimize impact, do not invoke more than one repair at a time.
Guidelines for running routine node repair include:
• Schedule regular repair operations for low-usage hours.
• A full repair is recommended to eliminate the need for anticompaction.
• Migrating to incremental repairs is recommended if you use leveled compaction.
• The hard requirement for routine repair frequency is the value of gc_grace_seconds. Run a repair
operation at least once on each node within this time period. Following this important guideline ensures
that deletes are properly handled in the cluster.
• In systems that seldom delete or overwrite data, you can raise the value of gc_grace_seconds with
minimal impact to disk space. A higher value schedules wider intervals between repair operations.
• To mitigate heavy disk usage, configure nodetool compaction throttling options
(setcompactionthroughput and setcompactionthreshold) before running a repair.
In Cassandra 2.1.1, sstablerepairedset can take as arguments a list of SSTables on the command
line or a file SSTables with a "-f" flag.
Note: In RHEL and Debian installations, you must install the tools packages.
This example shows how to use sstablerepairedset to clear the repaired state of an SSTable, rendering the
SSTable unrepaired.
1. Stop the node.
141
Operations
Procedure
Be sure to install the same version of Cassandra as is installed on the other nodes in the cluster. See
Installing prior releases.
1. Install Cassandra on the new nodes, but do not start Cassandra.
If you used the Debian install, Cassandra starts automatically and you must stop the node and clear the
data.
2. Set the following properties in the cassandra.yaml and, depending on the snitch, the cassandra-
topology.properties or cassandra-rackdc.properties configuration files:
• auto_bootstrap - If this option has been set to false, you must set it to true. This option is not listed in
the default cassandra.yaml configuration file and defaults to true.
• cluster_name - The name of the cluster the new node is joining.
142
Operations
JVM_OPTS="$JVM_OPTS -Dconsistent.rangemovement=false
Tarball installations
143
Operations
$ bin/cassandra -Dconsistent.rangemovement=false
Procedure
Be sure to install the same version of Cassandra as is installed on the other nodes in the cluster. See
Installing prior releases.
1. Ensure that you are using NetworkTopologyStrategy for all of your keyspaces.
2. For each node, set the following properties in the cassandra.yaml file:
a) Add (or edit) auto_bootstrap: false.
By default, this setting is true and not listed in the cassandra.yaml file. Setting this parameter to
false prevents the new nodes from attempting to get all the data from the other nodes in the data
center. When you run nodetool rebuild in the last step, each node is properly mapped.
b) Set other properties, such as -seeds and endpoint_snitch, to match the cluster settings.
For more guidance, see Initializing a multiple node cluster (multiple data centers).
Note: Do not make all nodes seeds, see Internode communications (gossip).
c) If you want to enable vnodes, set num_tokens.
The recommended value is 256. Do not set the initial_token parameter.
3. Update the relevant property file for snitch used on all servers to include the new nodes. You do not
need to restart.
• GossipingPropertyFileSnitch: cassandra-rackdc.properties
• PropertyFileSnitch: cassandra-topology.properties
4. Ensure that your clients are configured correctly for the new cluster:
• If your client uses the DataStax Java, C#, or Python driver, set the load-balancing policy to
DCAwareRoundRobinPolicy (Java, C#, Python).
• If you are using another client such as Hector, make sure it does not auto-detect the new nodes so
that they aren't contacted by the client until explicitly directed. For example if you are using Hector,
use sethostConfig.setAutoDiscoverHosts(false);. If you are using Astyanax, use
ConnectionPoolConfigurationImpl.setLocalDatacenter("<data center name">) to
ensure you are connecting to the specified data center.
144
Operations
• If you are using Astyanax 2.x, with integration with the DataStax Java Driver 2.0,
you can set the load-balancing policy to DCAwareRoundRobinPolicy by calling
JavaDriverConfigBuilder.withLoadBalancingPolicy().
5. If using a QUORUM consistency level for reads or writes, check the LOCAL_QUORUM or
EACH_QUORUM consistency level to see if the level meets your requirements for multiple data
centers.
6. Start Cassandra on the new nodes.
7. After all nodes are running in the cluster:
a) Change the keyspace properties to specify the desired replication factor for the new data center.
For example, set strategy options to DC1:2, DC2:2.
For more information, see ALTER KEYSPACE.
b) Run nodetool rebuild specifying the existing data center on all nodes in the new data center:
Otherwise, requests to the new data center with LOCAL_ONE or ONE consistency levels may fail if
the existing data centers are not completely in-sync.
You can run rebuild on one or more nodes at the same time. The choices depends on whether your
cluster can handle the extra IO and network pressure of running on multiple nodes. Running on one
node at a time has the least impact on the existing cluster.
Attention: If you don't specify the existing data center in the command line, the new nodes
will appear to rebuild successfully, but will not contain any data.
8. Change to true or remove auto_bootstrap: false in the cassandra.yaml file.
Returns this parameter to its normal setting so the nodes can get all the data from the other nodes in
the data center if restarted.
145
Operations
Procedure
Be sure to install the same version of Cassandra as is installed on the other nodes in the cluster. See
Installing prior releases.
1. Confirm that the node is dead using nodetool status:
The nodetool command shows a down status for the dead node (DN):
JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=address_of_dead_node
• Tarball installations: Start Cassandra with this option:
What to do next
• Remove the old node's IP address from the cassandra-topology.properties or cassandra-
rackdc.properties file.
146
Operations
Caution: Wait at least 72 hours to ensure that old node information is removed from gossip. If
removed from the property file too soon, problems may result.
• Remove the node.
Caution: Wait at least 72 hours to ensure that old node information is removed from gossip. If
removed from the property file too soon, problems may result.
The location of the cassandra-topology.properties file depends on the type of installation:
Procedure
Be sure to install the same version of Cassandra as is installed on the other nodes in the cluster. See
Installing prior releases.
1. Prepare and start the replacement node, as described in Adding nodes to an existing cluster.
147
Operations
Procedure
1. Make sure no clients are still writing to any nodes in the data center.
2. Run a full repair with nodetool repair.
This ensures that all data is propagated from the data center being decommissioned.
3. Change all keyspaces so they no longer reference the data center being removed.
4. Run nodetool decommission on every node in the data center being removed.
Removing a node
Reduce the size of a data center.
148
Operations
Procedure
1. Check whether the node is up or down using nodetool status:
The nodetool command shows the status of the node (UN=up, DN=down):
Switching snitches
Steps for switching snitches.
Procedure
1. Create a properties file with data center and rack information.
149
Operations
• cassandra-rackdc.properties
GossipingPropertyFileSnitch, Ec2Snitch, and Ec2MultiRegionSnitch only.
• cassandra-topology.properties
All other network snitches.
2. Copy the cassandra-rackdc.properties or cassandra-topology.properties file to the Cassandra
configuration directory on all the cluster's nodes. They won't be used until the new snitch is enabled.
The location of the cassandra-topology.properties file depends on the type of installation:
endpoint_snitch: GossipingPropertyFileSnitch
4. If the topology has not changed, you can restart each node one at a time.
Any change in the cassandra.yaml file requires a node restart.
5. If the topology of the network has changed:
a) Shut down all the nodes, then restart them.
b) Run a sequential repair and nodetool cleanup on each node.
Procedure
The following method ensures that if something goes wrong with the new cluster, you still have the existing
cluster until you no longer need it.
150
Operations
1. Set up and configure the new cluster as described in Provisioning a new cluster.
Note: If you're not using vnodes, be sure to configure the token ranges in the new nodes to
match the ranges in the old cluster.
2. Set up the schema for the new cluster using CQL.
3. Configure your client to write to both clusters.
Depending on how the writes are done, code changes may be needed. Be sure to use identical
consistency levels.
4. Ensure that the data is flowing to the new nodes so you won't have any gaps when you copy the
snapshots to the new cluster in step 6.
5. Snapshot the old EC2 cluster.
6. Copy the data files from your keyspaces to the nodes.
•If not using vnodes and the if the node ratio is 1:1, it's simpler and more efficient to simply copy the
data files to their matching nodes.
• If the clusters are different sizes or if you are using vnodes, use the Cassandra bulk loader
(sstableloader) (sstableloader).
7. You can either switch to the new cluster all at once or perform an incremental migration.
For example, to perform an incremental migration, you can set your client to designate a percentage of
the reads that go to the new cluster. This allows you to test the new cluster before decommissioning the
old cluster.
8. Decommission the old cluster:
a) Remove the cluster from the OpsCenter.
b) Remove the nodes.
151
Operations
When increases capacity by a non-uniform number of nodes, you must recalculate tokens for the entire
cluster, and then use nodetool move to assign the new tokens to the existing nodes. After all nodes are
restarted with their new token assignments, run a nodetool cleanup to remove unused keys on all nodes.
These operations are resource intensive and should be done during low-usage times.
Add one node at a time and leave the initial_token property empty
When the initial_token is empty, Cassandra splits the token range of the heaviest loaded node and places
the new node into the ring at that position. This approach is unlikely to result in a perfectly balanced ring,
but will alleviate hot spots.
hostConfig.setAutoDiscoverHosts(false);
152
Operations
5. If using a QUORUM consistency level for reads or writes, check the LOCAL_QUORUM or
EACH_QUORUM consistency level to make sure that the level meets the requirements for multiple data
centers.
6. Start the new nodes.
7. After all nodes are running in the cluster:
a. Change the replication factor for your keyspace for the expanded cluster.
b. Run nodetool rebuild on each node in the new data center.
initial_token: 28356863910078205288614550619314017620
6. Configure any non-default settings in the node's cassandra.yaml to match your existing cluster.
7. Start the new node.
8. After the new node has finished bootstrapping, check that it is marked up using the nodetool ring
command.
9. Run nodetool repair on each keyspace to ensure the node is fully consistent. For example:
153
Backing up and restoring data
About snapshots
A brief description of how Cassandra backs up data.
Cassandra backs up data by taking a snapshot of all on-disk data files (SSTable files) stored in the data
directory. You can take a snapshot of all keyspaces, a single keyspace, or a single table while the system
is online.
Using a parallel ssh tool (such as pssh), you can snapshot an entire cluster. This provides an eventually
consistent backup. Although no one node is guaranteed to be consistent with its replica nodes at the time
a snapshot is taken, a restored snapshot resumes consistency using Cassandra's built-in consistency
mechanisms.
After a system-wide snapshot is performed, you can enable incremental backups on each node to backup
data that has changed since the last snapshot: each time an SSTable is flushed, a hard link is copied into a
/backups subdirectory of the data directory (provided JNA is enabled).
Note: If JNA is enabled, snapshots are performed by hard links. If not enabled, I/O activity
increases as the files are copied from one location to another, which significantly reduces efficiency.
Taking a snapshot
Steps for taking a global snapshot or per node.
Procedure
Run the nodetool snapshot command, specifying the hostname, JMX port, and keyspace. For example:
Results
The snapshot is created in data_directory_location/keyspace_name/table_name-UUID/
snapshots/snapshot_name directory. Each snapshot directory contains numerous .db files that
contain the data at the time of the snapshot.
154
Backing up and restoring data
For example:
Package installations:
/var/lib/cassandra/data/mykeyspace/users-081a1500136111e482d09318a3b15cc2/
snapshots/1406227071618/mykeyspace-users-ka-1-Data.db
Tarball installations:
install_location/data/data/mykeyspace/
users-081a1500136111e482d09318a3b15cc2/snapshots/1406227071618/mykeyspace-
users-ka-1-Data.db
Procedure
To delete all snapshots for a node, run the nodetool clearsnapshot command. For example:
To delete snapshots on all nodes at once, run the nodetool clearsnapshot command using a parallel
ssh utility.
155
Backing up and restoring data
Procedure
Edit the cassandra.yaml configuration file on each node in the cluster and change the value of
incremental_backups to true.
Procedure
You can restore a snapshot in several ways:
• Use the sstableloader tool.
• Copy the snapshot SSTable directory (see Taking a snapshot) to the
data/keyspace/table_name-UUID directory and then call the JMX method
loadNewSSTables() in the column family MBean for each column family through JConsole. You can
use nodetool refresh instead of the loadNewSSTables() call.
The location of the data directory depends on the type of installation:
• Package installations: /var/lib/cassandra/data
• Tarball installations: install_location/data/data
The tokens for the cluster you are restoring must match exactly the tokens of the backed-up cluster
at the time of the snapshot. Furthermore, the snapshot must be copied to the correct node with
matching tokens matching. If the tokens do not match, or the number of nodes do not match, use the
sstableloader procedure.
• Use the Node Restart Method described below.
156
Backing up and restoring data
Procedure
1. Shut down the node.
2. Clear all files in the commitlog directory.
This prevents the commitlog replay from putting data back, which would defeat the purpose of restoring
data to a particular point in time.
3. Delete all *.db files in data_directory_location/keyspace_name/keyspace_name-
keyspace_name directory, but DO NOT delete the /snapshots and /backups subdirectories.
where data_directory_location is:
• Package installations: /var/lib/cassandra/data
• Tarball installations: install_location/data/data
4. Locate the most recent snapshot folder in this directory:
data_directory_location/keyspace_name/table_name-UUID/
snapshots/snapshot_name
5. Copy its contents into this directory:
data_directory_location/keyspace_name/table_name-UUID directory.
6. If using incremental backups, copy all contents of this directory:
data_directory_location/keyspace_name/table_name-UUID/backups
7. Paste it into this directory:
data_directory_location/keyspace_name/table_name-UUID
8. Restart the node.
Restarting causes a temporary burst of I/O activity and consumes a large amount of CPU resources.
9. Run nodetool repair.
157
Backing up and restoring data
Note: This procedure assumes you are familiar with restoring a snapshot and configuring and
initializing a cluster. If not, see Initializing a cluster.
The location of the cassandra.yaml file depends on the type of installation:
Procedure
To recover the snapshot on the new cluster:
1. From the old cluster, retrieve the list of tokens associated with each node's IP:
This allows Cassandra to read the SSTable snapshot from the old cluster.
158
Backing up and restoring data
$ nodetool ring | grep ip_address_of_node | awk ' {print $NF ","}' | xargs
4. On the node with the new disk, add the list of tokens from the previous step (separated by commas),
under initial_token in the cassandra.yaml file.
5. Clear each system directory for every functioning drive:
Assuming disk1 has failed and the data_file_directories setting in the cassandra.yaml for each drive
is:
-/mnt1/cassandra/data
-/mnt2/cassandra/data
-/mnt3/cassandra/data
$ rm -fr /mnt2/cassandra/data/system
$ rm -fr /mnt3/cassandra/data/system
6. Start the node and Cassandra.
7. Run nodetool repair.
8. After the node is fully integrated into the cluster, it is recommended to return to normal vnode settings:
• num_tokens: number_of_tokens
• #initial_token
If the node uses assigned tokens (single-token architecture):
1. Stop Cassandra and shut down the node.
2. Replace the failed disk.
3. Clear each system directory for every functioning drive:
Assuming disk1 has failed and the data_file_directories setting in the cassandra.yaml for each drive
is:
-/mnt1/cassandra/data
-/mnt2/cassandra/data
-/mnt3/cassandra/data
$ rm -fr /mnt2/cassandra/data/system
$ rm -fr /mnt3/cassandra/data/system
4. Start the node and Cassandra.
5. Run nodetool repair on the node.
The location of the cassandra.yaml file depends on the type of installation:
159
Cassandra tools
Cassandra tools
Topics for Cassandra tools.
Command format
• Package installations: nodetool [(-h <host> | --host <host>)] [(-p <port> | --port
<port>)] [(-pwf <passwordFilePath> | --password-file <passwordFilePath>)]
[(-u <username> | --username <username>)] [(-pw <password> | --password
<password>)] <command> [<args>]
• Tarball installations: Execute the command from install_location/bin
• If a username and password for RMI authentication are set explicitly in the cassandra-env.sh file for
the host, then you must specify credentials.
The repair and rebuild commands can affect multiple nodes in the cluster.
Most nodetool commands operate on a single node in the cluster if -h is not used to identify one or
more other nodes. If the node from which you issue the command is the intended target, you do not
need the -h option to identify the target; otherwise, for remote invocation, identify the target node, or
nodes, using -h.
nodetool cfhistograms
Provides statistics about a table that could be used to plot a frequency function.
Provides statistics about a table that could be used to plot a frequency function.
Synopsis
160
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The nodetool cfhistograms command provides statistics about a table, including read/write latency,
partition size, column count, and number of SSTables. The report covers all operations since the last time
you ran nodetool cfhistograms in this session. The use of the metrics-core library in Cassandra 2.1 makes
the output more informative and easier to understand.
Example
For example, to get statistics about the libout table in the libdata keyspace on Linux, use this command:
Output is:
libdata/libout histograms
Percentile SSTables Write Latency Read Latency Partition Size
Cell Count
(micros) (micros) (bytes)
The output shows the percentile rank of read and write latency values, the partition size, and the cell count
for the table.
nodetool cfstats
Provides statistics about tables.
Provides statistics about tables.
Synopsis
161
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The nodetool cfstats command provides statistics about one or more tables. You use dot notation
to specify one or more keyspace and table names. If you do not specify a keyspace and table, Cassandra
provides statistics about all tables. If you use the -i option, Cassandra provides statistics about all tables
except the given ones. The use of the metrics-core library in Cassandra 2.1 makes the output more
informative and easier to understand.
This table describes the nodetool cfstats output.
162
Cassandra tools
163
Cassandra tools
164
Cassandra tools
Examples
This example shows an excerpt of the output of the command after flushing a table of library data to disk.
165
Cassandra tools
166
Cassandra tools
----------------
The location of the cassandra.yaml file depends on the type of installation:
nodetool cleanup
Cleans up keyspaces and partition keys no longer belonging to a node.
Cleans up keyspaces and partition keys no longer belonging to a node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Use this command to remove unwanted data after adding a new node to the cluster. Cassandra does
not automatically remove data from nodes that lose part of their partition range to a newly added node.
Run nodetool cleanup on the source node and on neighboring nodes that shared the same subrange
after the new node is up and running. Failure to run this command after adding a node causes Cassandra
to include the old data to rebalance the load on that node. Running the nodetool cleanup command
causes a temporary increase in disk space usage proportional to the size of your largest SSTable. Disk I/O
occurs when running this command.
Running this command affects nodes that use a counter column in a table. Cassandra assigns a new
counter ID to the node.
Optionally, this command takes a list of table names. If you do not specify a keyspace, this command
cleans all keyspaces no longer belonging to a node.
167
Cassandra tools
nodetool clearsnapshot
Removes one or more snapshots.
Removes one or more snapshots.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Deletes snapshots in one or more keyspaces. To remove all snapshots, omit the snapshot name.
Caution: Removing all snapshots or all specific keyspace snapshots also removes OpsCenter
backups.
nodetool compact
Forces a major compaction on one or more tables.
Forces a major compaction on one or more tables.
Synopsis
168
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This command starts the compaction process on tables that use the SizeTieredCompactionStrategy
and DateTieredCompactionStrategy. You can specify a keyspace for compaction. If you do not specify
a keyspace, the nodetool command uses the current keyspace. You can specify one or more tables for
compaction. If you do not specify any tables, compaction of all tables in the keyspace occurs. This is called
a major compaction. If you do specify one or more tables, compaction of the specified tables occurs. This
is called a minor compaction. A major compaction combines each of the pools of repaired and unrepaired
SSTables into one repaired and one unreparied SSTable. During compaction, there is a temporary spike in
disk space usage and disk I/O because the old and new SSTables co-exist. A major compaction can cause
considerable disk I/O.
nodetool compactionhistory
Provides the history of compaction operations.
Provides the history of compaction operations.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Example
The actual output of compaction history is seven columns wide. The first three columns show the id,
keyspace name, and table name of the compacted SSTable.
$ nodetool compactionhistory
Compaction History:
id keyspace_name
columnfamily_name
d06f7080-07a5-11e4-9b36-abc3a0ec9088 system
schema_columnfamilies
169
Cassandra tools
The four columns to the right of the table name show the timestamp, size of the SSTtable before and after
compaction, and the number of partitions merged. The notation means {tables:rows}. For example: {1:3,
3:1} means 3 rows were taken from one SSTable (1:3) and 1 row taken from 3 SSTables (3:1) to make the
one SSTable in that compaction operation.
170
Cassandra tools
nodetool compactionstats
Provide statistics about a compaction.
Provide statistics about a compaction.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
171
Cassandra tools
Description
The total column shows the total number of uncompressed bytes of SSTables being compacted. The
system log lists the names of the SSTables compacted.
Example
$ bin/nodetool compactionstats
pending tasks: 5
compaction type keyspace table completed
total unit progress
Compaction Keyspace1 Standard1 282310680
302170540 bytes 93.43%
Compaction Keyspace1 Standard1 58457931
307520780 bytes 19.01%
Active compaction remaining time : 0h00m16s
nodetool decommission
Deactivates a node by streaming its data to another node.
Deactivates a node by streaming its data to another node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Causes a live node to decommission itself, streaming its data to the next node on the ring. Use netstats to
monitor the progress, as described on http://wiki.apache.org/cassandra/NodeProbe#Decommission and
http://wiki.apache.org/cassandra/Operations#Removing_nodes_entirely.
nodetool describecluster
Provide the name, snitch, partitioner and schema version of a cluster
Provide the name, snitch, partitioner and schema version of a cluster
Synopsis
172
Cassandra tools
• options are:
• ( -h | --host ) <host name> | <ip address>
• ( -p | --port ) <port number>
• ( -pw | --password ) <password >
• ( -u | --username ) <user name>
• ( -pwf <passwordFilePath | --password-file <passwordFilePath> )
• -- separates an option and argument that could be mistaken for a option.
• data center is the name of an arbitrarily chosen data center from which to select sources for streaming.
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Describe cluster is typically used to validate the schema after upgrading. If a schema disagreement occurs,
check for and resolve schema disagreements.
Example
$ bin/nodetool describecluster
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
65e78f0e-e81e-30d8-a631-a65dff93bf82: [127.0.0.1]
If a schema disagreement occurs, the last line of the output includes information about unreachable nodes.
$ bin/nodetool describecluster
Cluster Information:
Name: Production Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
UNREACHABLE: 1176b7ac-8993-395d-85fd-41b89ef49fbb:
[10.202.205.203]
nodetool describering
Provides the partition ranges of a keyspace.
Provides the partition ranges of a keyspace.
Synopsis
173
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Example
This example shows the sample output of the command on a three-node cluster.
Schema Version:1b04bd14-0324-3fc8-8bcb-9256d1e15f82
TokenRange:
TokenRange(start_token:3074457345618258602,
end_token:-9223372036854775808,
endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3],
rpc_endpoints:[127.0.0.1, 127.0.0.2, 127.0.0.3],
endpoint_details:[EndpointDetails(host:127.0.0.1,
datacenter:datacenter1, rack:rack1),
EndpointDetails(host:127.0.0.2, datacenter:datacenter1,
rack:rack1),
EndpointDetails(host:127.0.0.3, datacenter:datacenter1,
rack:rack1)])
TokenRange(start_token:-3074457345618258603,
end_token:3074457345618258602,
endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2],
rpc_endpoints:[127.0.0.3, 127.0.0.1, 127.0.0.2],
endpoint_details:[EndpointDetails(host:127.0.0.3,
datacenter:datacenter1, rack:rack1),
EndpointDetails(host:127.0.0.1, datacenter:datacenter1,
rack:rack1),
EndpointDetails(host:127.0.0.2, datacenter:datacenter1,
rack:rack1)])
TokenRange(start_token:-9223372036854775808,
end_token:-3074457345618258603,
endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1],
rpc_endpoints:[127.0.0.2, 127.0.0.3, 127.0.0.1],
endpoint_details:[EndpointDetails(host:127.0.0.2,
datacenter:datacenter1, rack:rack1),
EndpointDetails(host:127.0.0.3, datacenter:datacenter1,
rack:rack1),
EndpointDetails(host:127.0.0.1, datacenter:datacenter1,
rack:rack1)])
If a schema disagreement occurs, the last line of the output includes information about unreachable nodes.
174
Cassandra tools
$ bin/nodetool describecluster
Cluster Information:
Name: Production Cluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
UNREACHABLE: 1176b7ac-8993-395d-85fd-41b89ef49fbb:
[10.202.205.203]
nodetool disableautocompaction
Disables autocompaction for a keyspace and one or more tables.
Disables autocompaction for a keyspace and one or more tables.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The keyspace can be followed by one or more tables.
nodetool disablebackup
Disables incremental backup.
Disables incremental backup.
Synopsis
175
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool disablebinary
Disables the native transport.
Disables the native transport.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Disables the binary protocol, also known as the native transport.
nodetool disablegossip
Disables the gossip protocol.
Disables the gossip protocol.
Synopsis
176
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This command effectively marks the node as being down.
nodetool disablehandoff
Disables storing of future hints on the current node.
Disables storing of future hints on the current node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool disablethrift
Disables the Thrift server.
Disables the Thrift server.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
177
Cassandra tools
Description
nodetool disablethrift will disable thrift on a node preventing the node from acting as a coordinator.
The node can still be a replica for a different coordinator and data read at consistency level ONE could
be stale. To cause a node to ignore read requests from other coordinators, nodetool disablegossip
would also need to be run. However, if both commands are run, the node will not perform repairs, and
the node will continue to store stale data. If the goal is to repair the node, set the read operations to a
consistency level of QUORUM or higher while you run repair. An alternative approach is to delete the
node's data and restart the Cassandra process.
Examples
nodetool drain
Drains the node.
Drains the node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Flushes all memtables from the node to SSTables on disk. Cassandra stops listening for connections from
the client and other nodes. You need to restart Cassandra after running nodetool drain. You typically
use this command before upgrading a node to a new version of Cassandra. To simply flush memtables to
disk, use nodetool flush.
178
Cassandra tools
nodetool enableautocompaction
Enables autocompaction for a keyspace and one or more tables.
Enables autocompaction for a keyspace and one or more tables.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The keyspace can be followed by one or more tables. Enables compaction for the named keyspace or the
current keyspace, and one or more named tables, or all tables.
nodetool enablebackup
Enables incremental backup.
Enables incremental backup.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
179
Cassandra tools
nodetool enablebinary
Re-enables native transport.
Re-enables native transport.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Re-enables the binary protocol, also known as native transport.
nodetool enablegossip
Re-enables gossip.
Re-enables gossip.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
180
Cassandra tools
nodetool enablehandoff
Re-enables the storing of future hints on the current node.
Re-enables the storing of future hints on the current node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool enablethrift
Re-enables the Thrift server.
Re-enables the Thrift server.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool flush
Flushes one or more tables from the memtable.
Flushes one or more tables from the memtable.
181
Cassandra tools
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
You can specify a keyspace followed by one or more tables that you want to flush from the memtable to
SSTables on disk.
nodetool getcompactionthreshold
Provides the minimum and maximum compaction thresholds in megabytes for a table.
Provides the minimum and maximum compaction thresholds in megabytes for a table.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
182
Cassandra tools
nodetool getendpoints
Provides the IP addresses or names of replicas that own the partition key.
Provides the IP addresses or names of replicas that own the partition key.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Example
For example, which nodes own partition key_1, key_2, and key_3?
Note: The partitioner returns a token for the key. Cassandra will return an endpoint whether or not
data exists on the identified node for that token.
127.0.0.2
127.0.0.2
127.0.0.1
nodetool getlogginglevels
Get the runtime logging levels.
Get the runtime logging levels.
183
Cassandra tools
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool getsstables
Provides the SSTables that own the partition key.
Provides the SSTables that own the partition key.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool getstreamthroughput
Provides the megabytes per second throughput limit for streaming in the system.
Provides the megabytes per second throughput limit for streaming in the system.
184
Cassandra tools
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool gossipinfo
Provides the gossip information for the cluster.
Provides the gossip information for the cluster.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool help
Provides nodetool command help.
Provides nodetool command help.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
185
Cassandra tools
Description
Using this command without the The help command provides a synopsis and brief description of each
nodetool command.
Examples
Using nodetool help lists all commands and usage information. For example, nodetool help netstats
provides the following information.
NAME
nodetool netstats - Print network information on provided host
(connecting node by default)
SYNOPSIS
nodetool [(-h <host> | --host <host>)] [(-p <port> | --port <port>)]
[(-pw <password> | --password <password>)]
[(-u <username> | --username <username>)] netstats
OPTIONS
-h <host>, --host <host>
Node hostname or ip address
nodetool info
Provides node information, such as load and uptime.
Provides node information, such as load and uptime.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
186
Cassandra tools
Description
Provides node information including the token and on disk storage (load) information, times started
(generation), uptime in seconds, and heap memory usage.
nodetool invalidatekeycache
Resets the global key cache parameter to the default, which saves all keys.
Resets the global key cache parameter to the default, which saves all keys.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
By default the key_cache_keys_to_save is disabled in the cassandra.yaml. This command resets the
parameter to the default.
The location of the cassandra.yaml file depends on the type of installation:
nodetool invalidaterowcache
Resets the global key cache parameter, row_cache_keys_to_save, to the default (not set), which saves all
keys.
Resets the global key cache parameter, row_cache_keys_to_save, to the default (not set), which saves all
keys.
Synopsis
187
Cassandra tools
options are:
• ( -h | --host ) <host name> | <ip address>
• ( -p | --port ) <port number>
• ( -pw | --password ) <password >
• ( -u | --username ) <user name>
• ( -pwf <passwordFilePath | --password-file <passwordFilePath> )
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool join
Causes the node to join the ring.
Causes the node to join the ring.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Causes the node to join the ring, assuming the node was initially not started in the ring using the -
Djoin_ring=false cassandra utility option. The joining node should be properly configured with the desired
options for seed list, initial token, and auto-bootstrapping.
nodetool listsnapshots
Lists snapshot names, size on disk, and true size.
Lists snapshot names, size on disk, and true size.
Synopsis
188
Cassandra tools
options are:
• ( -h | --host ) <host name> | <ip address>
• ( -p | --port ) <port number>
• ( -pw | --password ) <password >
• ( -u | --username ) <user name>
• ( -pwf <passwordFilePath | --password-file <passwordFilePath> )
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Available in Cassandra 2.1 and later.
Example
Snapshot Details:
Snapshot Name Keyspace Column Family True Size Size on Disk
nodetool move
Moves the node on the token ring to a new token.
Moves the node on the token ring to a new token.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
189
Cassandra tools
Description
Escape negative tokens using \\ . For example: move \\-123 . This command essentially combines
decommission and bootstrapoperations.
nodetool netstats
Provides network information about the host.
Provides network information about the host.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The default host is the connected host if the user does not include a host name or IP address in the
command. The output includes the following information:
• JVM settings
• Mode
• Read repair statistics
• Attempted
The number of successfully completed read repair operations
• Mismatch (blocking)
The number of read repair operations since server restart that blocked a query.
• Mismatch (background)
The number of read repair operations since server restart performed in the background.
• Pool name
Information about client read and write requests by thread pool.
• Active, pending, and completed number of commands and responses
190
Cassandra tools
Example
Get the network information for a node 10.171.147.128:
Mode: NORMAL
Not sending any streams.
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed
Commands n/a 0 1156
Responses n/a 0 2750
nodetool pausehandoff
Pauses the hints delivery process
Pauses the hints delivery process
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool proxyhistograms
Provides a histogram of network statistics.
Provides a histogram of network statistics.
Synopsis
191
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The output of this command shows the full request latency recorded by the coordinator. The output
includes the percentile rank of read and write latency values for inter-node communication. Typically, you
use the command to see if requests encounter a slow node.
Examples
This example shows the output from nodetool proxyhistograms after running 4,500 insert statements and
45,000 select statements on a three ccm node-cluster on a local computer.
$ nodetool proxyhistograms
proxy histograms
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 1502.50 375.00 446.00
75% 1714.75 420.00 498.00
95% 31210.25 507.00 800.20
98% 36365.00 577.36 948.40
99% 36365.00 740.60 1024.39
Min 616.00 230.00 311.00
Max 36365.00 55726.00 59247.00
nodetool rangekeysample
Provides the sampled keys held across all keyspaces.
Provides the sampled keys held across all keyspaces.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
192
Cassandra tools
nodetool rebuild
Rebuilds data by streaming from other nodes.
Rebuilds data by streaming from other nodes.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This command operates on multiple nodes in a cluster. Similar to bootstrap. Rebuild (like bootstrap) only
streams data from a single source replica per range. Use this command to bring up a new data center in an
existing cluster.
nodetool rebuild_index
Performs a full rebuild of the index for a table
Performs a full rebuild of the index for a table
Synopsis
193
Cassandra tools
The keyspace and table name followed by a list of index names. For example: Standard3.IdxName
Standard3.IdxName1
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Fully rebuilds one or more indexes for a table.
nodetool refresh
Loads newly placed SSTables onto the system without a restart.
Loads newly placed SSTables onto the system without a restart.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool reloadtriggers
Reloads trigger classes.
Reloads trigger classes.
Synopsis
194
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Available in Cassandra 2.1 and later.
nodetool removenode
Provides the status of current node removal, forces completion of pending removal, or removes the
identified node.
Provides the status of current node removal, forces completion of pending removal, or removes the
identified node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This command removes a node, shows the status of a removal operation, or forces the completion
of a pending removal. When the node is down and nodetool decommission cannot be used, use
nodetool removenode. Run this command only on nodes that are down. If the cluster does not use
vnodes, before running the nodetool removenode command, adjust the tokens.
195
Cassandra tools
Examples
Determine the UUID of the node to remove by running nodetool status. Use the UUID of the node that
is down to remove the node.
$ nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 192.168.2.101 112.82 KB 256 31.7% 420129fc-0d84-42b0-
be41-ef7dd3a8ad06 RAC1
DN 192.168.2.103 91.11 KB 256 33.9% d0844a21-3698-4883-
ab66-9e2fd5150edd RAC1
UN 192.168.2.102 124.42 KB 256 32.6% 8d5ed9f4-7764-4dbd-
bad8-43fddce94b7c RAC1
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN 192.168.2.101 112.82 KB 256 37.7% 420129fc-0d84-42b0-
be41-ef7dd3a8ad06 RAC1
UN 192.168.2.102 124.42 KB 256 38.3% 8d5ed9f4-7764-4dbd-
bad8-43fddce94b7c RAC1
nodetool repair
Repairs one or more tables.
Repairs one or more tables.
Synopsis
196
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Performing an anti-entropy node repair on a regular basis is important, especially when frequently deleting
data. The nodetool repair command repairs one or more nodes in a cluster, and includes an option to
restrict repair to a set of nodes. Anti-entropy node repair performs the following tasks:
• Ensures that all data on a replica is consistent.
• Repairs inconsistencies on a node that has been down.
By default, Cassandra 2.1 does a full, sequential repair.
Using options
You can use options to do these other types of repair:
• Incremental
• Parallel
Use the -hosts option to list the good nodes to use for repairing the bad nodes. Use -h to name the bad
nodes.
Use the -inc option for an incremental repair. An incremental repair eliminates the need for constant Merkle
tree construction by persisting already repaired data and calculating only the Merkle trees for SSTables
197
Cassandra tools
that have not been repaired. The repair process is likely more performant than the other types of repair
even as datasets grow, assuming you run repairs frequently. Before doing an incremental repair for the first
time, perform migration steps first if necessary.
Use the -par option for a parallel repair. Unlike sequential repair, parallel repair constructs the Merkle
tables for all nodes at the same time. Therfore, no snapshots are required (or generated). Use a parallel
repair to complete the repair quickly or when you have operational downtime that allows the resources to
be completely consumed during the repair.
Performing partitioner range repairs by using the -pr option is generally not recommended.
Example
All nodetool repair arguments are optional. The following examples show the following types of repair:
• An incremental, parallel repair of all keyspaces on the current node
• A partitioner range repair of the bad partition on current node using the good partitions on 10.2.2.20 or
10.2.2.21
• A start-point-to-end-point repair of all nodes between two nodes on the ring
An inspection of the system.log shows repair taking place only on IP addresses in DC1.
. . .
INFO [AntiEntropyStage:1] 2014-07-24 22:23:10,708 RepairSession.java:171
- [repair #16499ef0-1381-11e4-88e3-c972e09793ca] Received merkle tree
for sessions from /192.168.2.101
INFO [RepairJobTask:1] 2014-07-24 22:23:10,740 RepairJob.java:145
- [repair #16499ef0-1381-11e4-88e3-c972e09793ca] requesting merkle trees
for events (to [/192.168.2.103, /192.168.2.101])
. . .
nodetool resetlocalschema
Reset the node's local schema and resynchronizes.
Reset the node's local schema and resynchronizes.
198
Cassandra tools
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Normally, this command is used to rectify schema disagreements on different nodes. It can be useful if
table schema changes have generated too many tombstones, on the order of 100,000s.
nodetool resetlocalschema drops the schema information of the local node and resynchronizes the
schema from another node. To drop the schema, the tool truncates all the system schema tables. The
node will temporarily lose metadata about the tables on the node, but will rewrite the information from
another node. If the node is experiencing problems with too many tombstones, the truncation of the tables
will eliminate the tombstones.
This command is useful when you have one node that is out of sync with the cluster. The system schema
tables must have another node from which to fetch the tables. It is not useful when all or many of your
nodes are in an incorrect state. If there is only one node in the cluster (replication factor of 1) – it does
not perform the operation, because another node from which to fetch the tables does not exist. Run the
command on the node experiencing difficulty.
nodetool resumehandoff
Resume hints delivery process.
Resume hints delivery process.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
199
Cassandra tools
nodetool ring
Provides node status and information about the ring.
Provides node status and information about the ring.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Displays node status and information about the ring as determined by the node being queried. This
information can give you an idea of the load balance and if any nodes are down. If your cluster is not
properly configured, different nodes may show a different ring. Check that the node appears the same way
in the ring.If you use virtual nodes (vnodes), use nodetool status for succinct output.
• Address
The node's URL.
• DC (data center)
The data center containing the node.
• Rack
The rack or, in the case of Amazon EC2, the availability zone of the node.
• Status - Up or Down
Indicates whether the node is functioning or not.
• State - N (normal), L (leaving), J (joining), M (moving)
The state of the node in relation to the cluster.
• Load - updates every 90 seconds
The amount of file system data under the cassandra data directory after excluding all content in the
snapshots subdirectories. Because all SSTable data files are included, any data that is not cleaned up,
such as TTL-expired cell or tombstoned data) is counted.
200
Cassandra tools
• Token
The end of the token range up to and including the value listed. For an explanation of token ranges, see
Data Distribution in the Ring .
• Owns
The percentage of the data owned by the node per data center times the replication factor. For
example, a node can own 33% of the ring, but show100% if the replication factor is 3.
• Host ID
The network ID of the node.
nodetool scrub
Rebuild SSTables for one or more Cassandra tables.
Rebuild SSTables for one or more Cassandra tables.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Rebuilds SSTables on a node for the named tables and snapshots data files before rebuilding as a safety
measure. If possible use upgradesstables. While scrub rebuilds SSTables, it also discards data that it
deems broken and creates a snapshot, which you have to remove manually. If the -ns option is specified,
snapshot creation is disabled. If scrub can't validate the column value against the column definition's data
type, it logs the partition key and skips to the next partition. Skipping corrupted partitions in tables having
counter columns results in under-counting. By default the scrub operation stops if you attempt to skip such
a partition. To force the scrub to skip the partition and continue scrubbing, re-run nodetool scrub using
the --skip-corrupted option.
201
Cassandra tools
nodetool setcachecapacity
Set global key and row cache capacities in megabytes.
Set global key and row cache capacities in megabytes.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The key-cache-capacity argument corresponds to the key_cache_size_in_mb parameter in the
cassandra.yaml. Each key cache hit saves one seek and each row cache hit saves a minimum of two
seeks. Devoting some memory to the key cache is usually a good tradeoff considering the positive effect
on the response time. The default value is empty, which means a minimum of five percent of the heap in
MB or 100 MB.
The row-cache-capacity argument corresponds to the row_cache_size_in_mb parameter in the
cassandra.yaml. By default, row caching is zero (disabled).
The counter-cache-capacity argument corresponds to the counter_cache_size_in_mb in the
cassandra.yaml. By default, counter caching is a minimum of 2.5% of Heap or 50MB.
The location of the cassandra.yaml file depends on the type of installation:
nodetool setcachekeystosave
Sets the number of keys saved by each cache for faster post-restart warmup.
Sets the number of keys saved by each cache for faster post-restart warmup.
202
Cassandra tools
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This command saves the specified number of key and row caches to the saved caches directory,
which you specify in the cassandra.yaml. The key-cache-keys-to-save argument corresponds to the
key_cache_keys_to_save in the cassandra.yaml, which is disabled by default, meaning all keys will
be saved. The row-cache-keys-to-save argument corresponds to the row_cache_keys_to_save in the
cassandra.yaml, which is disabled by default.
The location of the cassandra.yaml file depends on the type of installation:
nodetool setcompactionthreshold
Sets minimum and maximum compaction thresholds for a table.
Sets minimum and maximum compaction thresholds for a table.
Synopsis
203
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
This parameter controls how many SSTables of a similar size must be present before a minor compaction
is scheduled. The max_threshold table property sets an upper bound on the number of SSTables that
may be compacted in a single minor compaction, as described in http://wiki.apache.org/cassandra/
MemtableSSTable.
When using LeveledCompactionStrategy, maxthreshold sets the MAX_COMPACTING_L0, which limits
the number of L0 SSTables that are compacted concurrently to avoid wasting memory or running out of
memory when compacting highly overlapping SSTables.
nodetool setcompactionthroughput
Sets the throughput capacity for compaction in the system, or disables throttling.
Sets the throughput capacity for compaction in the system, or disables throttling.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
204
Cassandra tools
Description
Set value_in_mb to 0 to disable throttling.
nodetool sethintedhandoffthrottlekb
Sets hinted handoff throttle in kb/sec per delivery thread. (Cassandra 2.1.1 and later)
Sets hinted handoff throttle in kb/sec per delivery thread. (Cassandra 2.1.1 and later)
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
When a node detects that a node for which it is holding hints has recovered, it begins sending the hints to
that node. This setting specifies the maximum sleep interval per delivery thread in kilobytes per second
after delivering each hint. The interval shrinks proportionally to the number of nodes in the cluster. For
example, if there are two nodes in the cluster, each delivery thread uses the maximum interval; if there are
three nodes, each node throttles to half of the maximum interval, because the two nodes are expected to
deliver hints simultaneously.
Example
nodetool setlogginglevel
Set the log level for a service.
Set the log level for a service.
Synopsis
205
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
You can use this command to set logging levels for services instead of modifying the logback-text.xml file.
The following values are valid for the logger class qualifier:
• org.apache.cassandra
• org.apache.cassandra.db
• org.apache.cassandra.service.StorageProxy
The possible log levels are:
• ALL
• TRACE
• DEBUG
• INFO
• WARN
• ERROR
• OFF
If both class qualifier and level arguments to the command are empty or null, the command resets logging
to the initial configuration.
Example
This command sets the StorageProxy service to debug level.
nodetool setstreamthroughput
Sets the throughput capacity in MB for streaming in the system, or disable throttling.
Sets the throughput capacity in MB for streaming in the system, or disable throttling.
Synopsis
206
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Set value_in_MB to 0 to disable throttling.
nodetool settraceprobability
Sets the probability for tracing a request.
Sets the probability for tracing a request.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Probabilistic tracing is useful to determine the cause of intermittent query performance problems by
identifying which queries are responsible. This option traces some or all statements sent to a cluster.
Tracing a request usually requires at least 10 rows to be inserted.
A probability of 1.0 will trace everything whereas lesser amounts (for example, 0.10) only sample a certain
percentage of statements. Care should be taken on large and active systems, as system-wide tracing
will have a performance impact. Unless you are under very light load, tracing all requests (probability
1.0) will probably overwhelm your system. Start with a small fraction, for example, 0.001 and increase
only if necessary. The trace information is stored in a system_traces keyspace that holds two tables –
sessions and events, which can be easily queried to answer questions, such as what the most time-
207
Cassandra tools
consuming query has been since a trace was started. Query the parameters map and thread column in the
system_traces.sessions and events tables for probabilistic tracing information.
nodetool snapshot
Take a snapshot of one or more keyspaces, or of a table, to backup data.
Take a snapshot of one or more keyspaces, or of a table, to backup data.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Use this command to backup and restore using a snapshot. See the examples below for various options.
Cassandra flushes the node before taking a snapshot, takes the snapshot, and stores the data in the
snapshots directory of each keyspace in the data directory. If you do not specify the name of a snapshot
directory using the -t option, Cassandra names the directory using the timestamp of the snapshot, for
example 1391460334889. Follow the procedure for taking a snapshot before upgrading Cassandra. When
upgrading, backup all keyspaces. For more information about snapshots, see Apache documentation.
$ bin/nodetool snapshot
The following message appears:
208
Cassandra tools
Because you did not specify a snapshot name, Cassandra names snapshot directories using the
timestamp of the snapshot. If the keyspace contains no data, empty directories are not created.
Assuming the music keyspace contains two tables, songs and playlists, taking a snapshot of the keyspace
creates multiple snapshot directories named 2014.06.24. A number of .db files containing the data are
located in these directories. For example, from the installation directory:
$ cd data/data/music/playlists-bf8118508cfd11e3972273ded3cb6170/
snapshots/1404936753154
$ ls
music-playlists-ka-1-CompressionInfo.db music-playlists-ka-1-Index.db
music-playlists-ka-1-TOC.txt
music-playlists-ka-1-Data.db music-playlists-ka-1-Statistics.db
music-playlists-ka-1-Filter.db music-playlists-ka-1-Summary.db
$ cd data/data/music/songs-b8e385a08cfd11e3972273ded3cb6170/2014.06.24/
snapshots/1404936753154
209
Cassandra tools
Cassandra creates the snapshot directory named 1391461910600 that contains the backup data of
playlists table in data/data/music/playlists-bf8118508cfd11e3972273ded3cb6170/
snapshots, for example.
nodetool status
Provide information about the cluster, such as the state, load, and IDs.
Provide information about the cluster, such as the state, load, and IDs.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
The status command provides the following information:
• Status - U (up) or D (down)
Indicates whether the node is functioning or not.
210
Cassandra tools
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID
Rack
UN 127.0.0.1 47.66 KB 1 33.3% aaa1b7c1-6049-4a08-ad3e-3697a0e30e10
rack1
UN 127.0.0.2 47.67 KB 1 33.3% 1848c369-4306-4874-afdf-5c1e95b8732e
rack1
UN 127.0.0.3 47.67 KB 1 33.3% 49578bf1-728f-438d-b1c1-d8dd644b6f7f
rack1
nodetool statusbackup
Provide the status of backup
Synopsis
211
Cassandra tools
Synopsis Legend
In the synopsis section of each statement, formatting has the following meaning:
• Uppercase means literal
• Lowercase means not literal
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
A semicolon that terminates CQL statements is not included in the synopsis.
Description
Provides the status of backup.
nodetool statusbinary
Provide the status of native transport.
Provide the status of native transport.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Provides the status of the binary protocol, also known as the native transport.
nodetool statusgossip
Provide the status of gossip.
Synopsis
212
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Provides the status of gossip.
nodetool statushandoff
Provides the status of hinted handoff.
Synopsis
Synopsis Legend
In the synopsis section of each statement, formatting has the following meaning:
• Uppercase means literal
• Lowercase means not literal
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
A semicolon that terminates CQL statements is not included in the synopsis.
Description
Provides the status of hinted handoff.
nodetool statusthrift
Provide the status of the Thrift server.
Provide the status of the Thrift server.
Synopsis
213
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool stop
Stops the compaction process.
Stops the compaction process.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Stops an operation from continuing to run. This command is typically used to stop a compaction that has a
negative impact on the performance of a node. After the compaction stops, Cassandra continues with the
remaining operations in the queue. Eventually, Cassandra restarts the compaction.
nodetool stopdaemon
Stops the cassandra daemon.
Stops the cassandra daemon.
Synopsis
214
Cassandra tools
options are:
• ( -h | --host ) <host name> | <ip address>
• ( -p | --port ) <port number>
• ( -pw | --password ) <password >
• ( -u | --username ) <user name>
• ( -pwf <passwordFilePath | --password-file <passwordFilePath> )
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool tpstats
Provides usage statistics of thread pools.
Provides usage statistics of thread pools.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Cassandra is based on a Staged Event Driven Architecture (SEDA). Different tasks are separated into
stages that are connected by a messaging service. Stages have a queue and thread pool. Some stages
skip the messaging service and queue tasks immediately on a different stage if it exists on the same node.
The queues can back up if executing at the next stage is too busy. Having a queue get backed up can
cause performance bottlenecks. nodetool tpstats provides statistics about the number of active,
pending, and completed tasks for each stage of Cassandra operations by thread pool. A high number of
pending tasks for any pool can indicate performance problems, as described in http://wiki.apache.org/
cassandra/Operations#Monitoring.
Run the nodetool tpstatscommand on a local node for get thread pool statistics.
This table describes key indicators:
215
Cassandra tools
216
Cassandra tools
217
Cassandra tools
Example
Run the command every two seconds.
218
Cassandra tools
InternalResponseStage 0 0 1 0
0
HintedHandoff 0 0 0
nodetool truncatehints
Truncates all hints on the local node, or truncates hints for the one or more endpoints.
Truncates all hints on the local node, or truncates hints for the one or more endpoints.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
nodetool upgradesstables
Rewrites older SSTables to the current version of Cassandra.
Rewrites older SSTables to the current version of Cassandra.
Synopsis
219
Cassandra tools
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
Description
Rewrites SSTables on a node that are incompatible with the current version. Use this command when
upgrading your server or changing compression options.
Examples
$ upgradesstables
Reads only SSTables created by old major versions of Cassandra and re-writes them to the current version
one at a time.
$ upgradesstables -a
Reads all existing SSTables and re-writes them to the current Cassandra version one at a time.
$ upgradesstables -a keyspace table
Reads specific SSTables and re-writes them to the current Cassandra version one at a time.
nodetool version
Provides the version number of Cassandra running on the specified node.
Provides the version number of Cassandra running on the specified node.
Synopsis
Synopsis Legend
• Angle brackets (< >) mean not literal, a variable
• Italics mean optional
• The pipe (|) symbol means OR or AND/OR
• Ellipsis (...) means repeatable
• Orange ( and ) means not literal, indicates scope
220
Cassandra tools
221
Cassandra tools
Usage:
Package installations:
$ cd install_location/bin
$ sstableloader [options] path_to_keyspace
The sstableloader bulk loads the SSTables found in the keyspace directory to the configured target cluster,
where the parent directories of the directory path are used as the target keyspace/table name.
1. Go to the location of the SSTables:
$ cd /var/lib/cassandra/data/Keyspace1/Standard1/
2. To view the contents of the keyspace:
$ ls
Keyspace1-Standard1-jb-60-CRC.db
Keyspace1-Standard1-jb-60-Data.db
...
Keyspace1-Standard1-jb-60-TOC.txt
3. To bulk load the files, specify the path to Keyspace1/Standard1/ in the target cluster:
222
Cassandra tools
The following cassandra.yaml options can be over-ridden from the command line:
223
Cassandra tools
Note: You can also use the cassandra-env.sh file to pass additional options, such as maximum
and minimum heap size, to the Java virtual machine rather than setting them in the environment.
The location of the cassandra.yaml file depends on the type of installation:
Usage
Add the following to the cassandra-env.sh file:
JVM_OPTS="$JVM_OPTS -D[PARAMETER]"
For Tarball installations, you can run this tool from the command line:
$ cassandra [OPTIONS]
Examples:
• cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"
• Command line: bin/cassandra -Dcassandra.load_ring_state=false
The Example section contains more examples.
Option Description
-f Start the cassandra process in foreground. The default is to start as background
process.
-h Help.
-p filename Log the process ID in the named file. Useful for stopping Cassandra by killing its PID.
-v Print the version and exit.
Start-up parameters
The -D option specifies the start-up parameters in both the command line and cassandra-env.sh file.
cassandra.auto_bootstrap=false
Facilitates setting auto_bootstrap to false on initial set-up of the cluster. The next time you start the cluster,
you do not need to change the cassandra.yaml file on each node to revert to true.
cassandra.available_processors=number_of_processors
224
Cassandra tools
In a multi-instance deployment, multiple Cassandra instances will independently assume that all CPU
processors are available to it. This setting allows you to specify a smaller set of processors.
cassandra.boot_without_jna=true
If JNA fails to initialize, Cassandra fails to boot. Use this command to boot Cassandra without JNA.
cassandra.config=directory
The directory location of the cassandra.yaml file. The default location depends on the type of installation.
cassandra.initial_token=token
Use when virtual nodes (vnodes) are not used. Sets the initial partitioner token for a node the first time the
node is started. (Default: disabled)
Note: Vnodes are highly recommended as they automatically select tokens.
cassandra.join_ring=true|false
Set to false to start Cassandra on a node but not have the node join the cluster. (Default: true) You can use
nodetool join and a JMX call to join the ring afterwards.
cassandra.load_ring_state=true|false
Set to false to clear all gossip state for the node on restart. (Default: true)
cassandra.metricsReporterConfigFile=file
Enable pluggable metrics reporter. See Pluggable metrics reporting in Cassandra 2.0.2.
cassandra.native_transport_port=port
Set the port on which the CQL native transport listens for clients. (Default: 9042)
cassandra.partitioner=partitioner
Set the partitioner. (Default: org.apache.cassandra.dht.Murmur3Partitioner)
cassandra.replace_address=listen_address or broadcast_address of dead node
To replace a node that has died, restart a new node in its place specifying the listen_address or
broadcast_address that the new node is assuming. The new node must not have any data in its data
directory, that is, it must be in the same state as before bootstrapping.
Note: The broadcast_address defaults to the listen_address except when using the
Ec2MultiRegionSnitch.
cassandra.replayList=table
Allow restoring specific tables from an archived commit log.
cassandra.ring_delay_ms=ms
Defines the amount of time a node waits to hear from other nodes before formally joining the ring. (Default:
1000ms)
cassandra.rpc_port=port
Set the port for the Thrift RPC service, which is used for client connections. (Default: 9160).
cassandra.ssl_storage_port=port
Set the SSL port for encrypted communication. (Default: 7001)
cassandra.start_native_transport=true|false
Enable or disable the native transport server. See start_native_transport in cassandra.yaml. (Default:
true)
cassandra.start_rpc=true/false
Enable or disable the Thrift RPC server. (Default: true)
cassandra.storage_port=port
Set the port for inter-node communication. (Default: 7000)
cassandra.triggers_dir=directory
Set the default location for the trigger JARs. (Default: conf/triggers)
225
Cassandra tools
cassandra.write_survey=true
For testing new compaction and compression strategies. It allows you to experiment with different strategies
and benchmark write performance differences without affecting the production workload. See Testing
compaction and compression.
consistent.rangemovement=true
True makes Cassandra 2.1 bootstrapping behavior effective. False makes Cassandra 2.0 behavior effective.
Example
Clear gossip state when starting a node:
• Command line: bin/cassandra -Dcassandra.load_ring_state=false
• cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.load_ring_state=false"
Start Cassandra on a node and do not join the cluster:
• Command line: bin/cassandra -Dcassandra.join_ring=false
• cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.join_ring=false"
Replacing a dead node:
• Command line: bin/cassandra -Dcassandra.replace_address=10.91.176.160
• cassandra-env.sh: JVM_OPTS="$JVM_OPTS -Dcassandra.replace_address=10.91.176.160"
226
Cassandra tools
Command Description
read Multiple concurrent reads. The cluster must first be populated by a write test.
write Multiple concurrent writes against the cluster.
mixed Interleave basic commands with configurable ratio and distribution. The cluster must first
be populated by a write test.
Multiple concurrent updates of counters.
counter_write
counter_readMultiple concurrent reads of counters. The cluster must first be populated by a
counter_write test.
user Interleave user provided queries with configurable ratio and distribution.
help Display help for a command or option.
Display help for an option: cassandra-stress help [options] For example:
cassandra-stress help -schema
Important: Additional sub options are available for each option in the following table. Format:
Option Description
-pop Population distribution and intra-partition visit order.
Usage
$ -pop seq=? [no-wrap] [read-lookback=DIST(?)] [contents=?]
or
-pop [dist=DIST(?)] [contents=?]
-insert Insert specific options relating to various methods for batching and splitting partition updates.
Usage
$ -insert [revisit=DIST(?)] [visits=DIST(?)] partitions=DIST(?)
[batchtype=?] select-ratio=DIST(?)
-col Column details, such as size and count distribution, data generator, names, and comparator.
Usage
$ -col [n=DIST(?)] [slice] [super=?] [comparator=?] [timestamp=?]
[size=DIST(?)]
227
Cassandra tools
Option Description
$ -rate [threads>=?] [threads<=?] [auto]
228
Cassandra tools
229
Cassandra tools
"C3" blob,
"C4" blob,
PRIMARY KEY (key)
) WITH
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
default_time_to_live=0 AND
speculative_retry='99.0PERCENTILE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
#On Node1
$ cassandra-stress write n=1000000 cl=one -mode native cql3 -schema
keyspace="keyspace1" -pop seq=1..1000000 -log file=~/node1_load.log -node
$NODES
#On Node2
$ cassandra-stress write n=1000000 cl=one -mode native cql3 -schema
keyspace="keyspace1" -pop seq=1000001..2000000 -log file=~/node2_load.log -
node $NODES
Note the keyspace is defined and the -key option tells each instance which range of keys to populate.
During stress testing, you can keep the daemon running and send it commands through it using the --
send-to option.
Example
• Insert 1,000,000 rows to given host:
/tools/bin/cassandra-stress -d 192.168.1.101
When the number of rows is not specified, one million rows are inserted.
• Read 1,000,000 rows from given host:
230
Cassandra tools
Results:
op rate : 3954
partition rate : 3954
row rate : 3954
latency mean : 1.0
latency median : 0.8
latency 95th percentile : 1.5
latency 99th percentile : 1.8
latency 99.9th percentile : 2.2
latency max : 73.6
total gc count : 25
total gc mb : 1826
total gc time (s) : 1
avg gc time(ms) : 37
stdev gc time(ms) : 10
Total operation time : 00:00:59
Sleeping for 15s
Running with 4 threadCount
231
Cassandra tools
Data Description
total ops Running total number of operations during the run.
op/s Number of operations per second performed during the run.
pk/s Number of partition operations per second performed during the
run.
row/s Number of row operations per second performed during the run.
mean Average latency in milliseconds for each operation during that run.
med Median latency in milliseconds for each operation during that run.
.95 95% of the time the latency was less than the number displayed in
the column.
.99 99% of the time the latency was less than the number displayed in
the column.
.999 99.9% of the time the latency was less than the number displayed
in the column.
max Maximum latency in milliseconds.
time Total operation time.
stderr Standard error of the mean. It is a measure of confidence in the
average throughput number; the smaller the number, the more
accurate the measure of the cluster's performance.
gc: # Number of garbage collections.
max ms Longest garbage collection in milliseconds.
sum ms Total of garbage collection in milliseconds.
sdv ms Standard deviation in milliseconds.
mb Size of the garbage collection in megabytes.
Procedure
1. Before using sstablescrub, try rebuilding the tables using nodetool scrub.
If nodetool scrub does not fix the problem, use this utility.
2. Shut down the node.
232
Cassandra tools
$ sstablesplit -s 40 /var/lib/cassandra/data/Keyspace1/Standard1/*
233
Cassandra tools
$ bin/sstablekeys <sstable_name>
Procedure
1. Create the playlists table in the music keyspace as shown in Data modeling.
2. Insert the row of data about ZZ Top in playlists:
234
Cassandra tools
Usage:
• Package installations: sstableupgrade [options] <keyspace> <cf> [snapshot]
• Tarball installations: install_location/bin/sstableupgrade [options] <keyspace>
<cf> [snapshot]
The snapshot option only upgrades the specified snapshot.
235
References
References
Reference topics.
Procedure
You must have root or sudo permissions to start Cassandra as a service.
On initial start-up, each node must be started one at a time, starting with your seed nodes:
Procedure
On initial start-up, each node must be started one at a time, starting with your seed nodes.
• To start Cassandra in the background:
$ cd install_location
$ bin/cassandra
• To start Cassandra in the foreground:
$ cd install_location
$ bin/cassandra -f
236
References
Procedure
You must have root or sudo permissions to stop the Cassandra service:
Procedure
Find the Cassandra Java process ID (PID), and then kill the process using its PID number:
Procedure
To clear the data from the default directories:
After stopping the service, run the following command:
Procedure
To clear all data from the default directories, including the commitlog and saved_caches:
1. After stopping the process, run the following command from the install directory:
$ cd install_location
$ sudo rm -rf data/*
2. To clear the only the data directory:
$ cd install_location
$ sudo rm -rf data/data/*
237
References
Install locations
Install location topics.
Directories Description
data Files for commitlog, data, and saved_caches
(unless set in cassandra.yaml)
bin Utilities and start scripts
conf Configuration files and environment settings
interface Thrift and Avro client APIs
javadoc Cassandra Java API documentation
lib JAR and license files
tools Cassandra tools and sample cassandra.yaml
files for stress testing.
For DataStax Enterprise installs, see the documentation for your DataStax Enterprise version.
238
References
Directories Description
/var/lib/ Data directories
cassandra
/var/log/ Log directory
cassandra
/var/run/ Runtime files
cassandra
/usr/share/ Environment settings
cassandra
/usr/share/ JAR files
cassandra/lib
/usr/bin Optional utilities, such as sstablelevelreset, sstablerepairedset, and sstablesplit
/usr/bin Binary files
/usr/sbin
/etc/cassandra Configuration files
/etc/init.d Service startup script
/etc/security/ Cassandra user limits
limits.d
/etc/default
/usr/share/ Sample cassandra.yaml files for stress testing.
doc/cassandra/
examples
For DataStax Enterprise installs, see the documentation for your DataStax Enterprise version.
Keyspace attributes
Cassandra stores storage configuration attributes in the system keyspace. You can set storage engine
configuration attributes on a per-keyspace or per-table basis on the command line using the Cassandra-
CLI utility. A keyspace must have a user-defined name, a replica placement strategy, and options that
specify the number of copies per data center or node.
239
References
name
Required. The name for the keyspace.
placement_strategy
Required. Determines how Cassandra distributes replicas for a keyspace among nodes in the ring. Values
are:
• SimpleStrategy or org.apache.cassandra.locator.SimpleStrategy
• NetworkTopologyStrategy or
org.apache.cassandra.locator.NetworkTopologyStrategy
NetworkTopologyStrategy requires a snitch to be able to determine rack and data center locations of a node.
For more information about replication placement strategy, see Data replication.
strategy_options
Specifies configuration options for the chosen replication strategy class. The replication factor option is the
total number of replicas across the cluster. A replication factor of 1 means that there is only one copy of
each row on one node. A replication factor of 2 means there are two copies of each row, where each copy
is on a different node. All replicas are equally important; there is no primary or master replica. As a general
rule, the replication factor should not exceed the number of nodes in the cluster. However, you can increase
the replication factor and then add the desired number of nodes.
When the replication factor exceeds the number of nodes, writes are rejected, but reads are served as long
as the desired consistency level can be met.
For more information about configuring the replication placement strategy for a cluster and data centers,
see Choosing keyspace replication options.
durable_writes
(Default: true) When set to false, data written to the keyspace bypasses the commit log. Be careful using
this option because you risk losing data.
Table attributes
Attributes per table.
The following attributes can be declared per table.
bloom_filter_fp_chance
See CQL properties in CQL for Cassandra 2.x.
bucket_high
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
bucket_low
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
caching
See CQL properties in CQL for Cassandra 2.x.
chunk_length_kb
See CQLCompression Subproperties in CQL for Cassandra 2.x.
column_metadata
(Default: N/A - container attribute) Column metadata defines these attributes of a column:
• name: Binds a validation_class and (optionally) an index to a column.
• validation_class: Type used to check the column value.
• index_name: Name of the index.
• index_type: Type of index. Currently the only supported value is KEYS.
240
References
Setting a value for the name option is required. The validation_class is set to the default_validation_class
of the table if you do not set the validation_class option explicitly. The value of index_type must be set to
create an index for a column. The value of index_name is not valid unless index_type is also set.
Setting and updating column metadata with the Cassandra-CLI utility requires a slightly different command
syntax than other attributes; note the brackets and curly braces in this example:
column_type
(Default: Standard) The standard type of table contains regular columns.
comment
See CQL properties in CQL for Cassandra 2.x.
compaction_strategy
See compaction in CQL properties in CQL for Cassandra 2.x.
compaction_strategy_options
(Default: N/A - container attribute) Sets attributes related to the chosen compaction-strategy. Attributes are:
• bucket_high
• bucket_low
• max_compaction_threshold
• min_compaction_threshold
• min_sstable_size
• sstable_size_in_mb
• tombstone_compaction_interval
• tombstone_threshold
comparator
(Default: BytesType) Defines the data types used to validate and sort column names. There are several
built-in column comparators available. The comparator cannot be changed after you create a table.
compression_options
(Default: N/A - container attribute) Sets the compression algorithm and sub-properties for the table. Choices
are:
• sstable_compression
• chunk_length_kb
• crc_check_chance
crc_check_chance
See CQLCompression Subproperties in CQL for Cassandra 2.x.
default_time_to_live
See CQL properties in CQL for Cassandra 2.x.
default_validation_class
(Default: N/A) Defines the data type used to validate column values. There are several built-in column
validators available.
gc_grace
See CQL properties in CQL for Cassandra 2.x.
index_interval
See CQL properties in CQL for Cassandra 2.x.
key_validation_class
241
References
(Default: N/A) Defines the data type used to validate row key values. There are several built-in key validators
available, however CounterColumnType (distributed counters) cannot be used as a row key validator.
max_compaction_threshold
See max_threshold in CQL Compaction Subproperties in CQL for Cassandra 2.x.
min_compaction_threshold
See min_threshold in CQL Compaction Subproperties in CQL for Cassandra 2.x.
max_index_interval
See CQL properties in CQL for Cassandra 2.x.
min_index_interval
See CQL properties in CQL for Cassandra 2.x.
memtable_flush_after_mins
Deprecated as of Cassandra 1.0, but can still be declared for backward compatibility. Use
commitlog_total_space_in_mb.
memtable_flush_period_in_ms
See CQL properties in CQL for Cassandra 2.x.
memtable_operations_in_millions
Deprecated as of Cassandra 1.0, but can still be declared for backward compatibility. Use
commitlog_total_space_in_mb.
memtable_throughput_in_mb
Deprecated as of Cassandra 1.0, but can still be declared for backward compatibility. Use
commitlog_total_space_in_mb.
min_sstable_size
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
name
(Default: N/A) Required. The user-defined name of the table.
read_repair_chance
See CQL properties in CQL for Cassandra 2.x.
speculative_retry
See CQL properties in CQL for Cassandra 2.x.
sstable_size_in_mb
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
sstable_compression
See compression in CQL properties in CQL for Cassandra 2.x.
tombstone_compaction_interval
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
tombstone_threshold
See CQL Compaction Subproperties in CQL for Cassandra 2.x.
242
Moving data to or from other databases
ETL Tools
If you need more sophistication applied to a data movement situation (more than just extract-load), then
you can use any number of extract-transform-load (ETL) solutions that now support Cassandra. These
tools provide excellent transformation routines that allow you to manipulate source data in literally any way
you need and then load it into a Cassandra target. They also supply many other features such as visual,
point-and-click interfaces, scheduling engines, and more.
Many ETL vendors who support Cassandra supply community editions of their products that are free
and able to solve many different use cases. Enterprise editions are also available that supply many other
compelling features that serious enterprise data users need.
You can freely download and try ETL tools from Jaspersoft, Pentaho, and Talend that all work with
community Cassandra.
243
Troubleshooting
Troubleshooting
244
Troubleshooting
DiskAccessMode can be switched to mmap_index_only, which as the name implies will only mmap the
indices, using much less address space.
DataStax strongly recommends that you disable swap entirely (sudo swapoff --all). Because
Cassandra has multiple replicas and transparent failover, it is preferable for a replica to be killed
immediately when memory is low rather than go into swap. This allows traffic to be immediately redirected
to a functioning replica instead of continuing to hit the replica that has high latency due to swapping. If
your system has a lot of DRAM, swapping still lowers performance significantly because the OS swaps
out executable code so that more DRAM is available for caching disks. To make this change permanent,
remove all swap file entries from /etc/fstab.
If you insist on using swap, you can set vm.swappiness=1. This allows the kernel swap out the absolute
least used parts.
If the GCInspector isn't reporting very long GC times, but is reporting moderate times frequently
(ConcurrentMarkSweep taking a few seconds very often) then it is likely that the JVM is experiencing
extreme GC pressure and will eventually OOM. See the section below on OOM errors.
If you still cannot run nodetool commands remotely after making this configuration change, do a full
evaluation of your firewall and network security. The nodetool utility communicates through JMX on port
7199.
245
Troubleshooting
Procedure
1. Run the nodetool describecluster command.
$ bin/nodetool describecluster
If any node is UNREACHABLE, you see output something like this:
$ bin/nodetool describecluster
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
UNREACHABLE: 1176b7ac-8993-395d-85fd-41b89ef49fbb: [10.202.205.203]
9b861925-1a19-057c-ff70-779273e95aa6: [10.80.207.102]
8613985e-c49e-b8f7-57ae-6439e879bb2a: [10.116.138.23]
2. Restart unreachable nodes.
3. Repeat steps 1 and 2 until nodetool describecluster shows that all nodes have the same
schema version number#only one schema version appears in the output.
Java reports an error saying there are too many open files
Java may not have open enough file descriptors.
Java may not have open enough file descriptors.
Cassandra generally needs more than the default (1024) amount of file descriptors. To increase the
number of file descriptors, change the security limits on your Cassandra nodes as described in the
Recommended Settings section of Insufficient user resource limits errors.
Another, much less likely possibility, is a file descriptor leak in Cassandra. Run lsof -n | grep java
to check that the number of file descriptors opened by Java is reasonable and reports the error if the
number is greater than a few thousand.
Cassandra errors
Insufficient as (address space) or memlock setting
246
Troubleshooting
Recommended settings
You can view the current limits using the ulimit -a command. Although limits can also be temporarily
set using this command, DataStax recommends making the changes permanent:
Packaged installs: Ensure that the following settings are included in the /etc/security/limits.d/
cassandra.conf file:
Tarball installs: Ensure that the following settings are included in the /etc/security/limits.conf file:
* - memlock unlimited
* - nofile 100000
* - nproc 32768
* - as unlimited
If you run Cassandra as root, some Linux distributions such as Ubuntu, require setting the limits for root
explicitly instead of using *:
For CentOS, RHEL, OEL systems, also set the nproc limits in /etc/security/limits.d/90-
nproc.conf:
* - nproc 32768
vm.max_map_count = 131072
247
Troubleshooting
To make the changes take effect, reboot the server or run the following command:
$ sudo sysctl -p
To confirm the limits are applied to the Cassandra process, run the following command where pid is the
process ID of the currently running Cassandra process:
$ cat /proc/<pid>/limits
OpsCenter errors
See the OpsCenter Troubleshooting documentation.
248
Troubleshooting
Procedure
To prevent connections between nodes from timing out, set the TCP keep alive variables:
1. Get a list of available kernel variables:
249
DataStax Community release notes
JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false
• Tarball installations: Start Cassandra with this option:
$ bin/cassandra -Dcassandra.consistent.rangemovement=false
To replace a dead node, you also need to specify the address of the node from which Cassandra
streams the data.
250
DataStax Community release notes
For a complete list of fixes and new features, see the Apache Cassandra 2.1.0 CHANGES.txt. You can
view all version changes by branch or tag in the branch drop-down list:
251
Tips for using DataStax documentation
Other resources
You can find more information and help at:
• Documentation home page
• Datasheets
• Webinars
• Whitepapers
• Developer blogs
• Support
252