Want your very own server? Get our 1GB memory, Xeon V4, 25GB SSD VPS for £10.00 / month.

Overview

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. You can deploy a Cassandra cluster on a datacenter or span across multiple datacenters.

Apache Cassandra was initially developed on Facebook by Avinash Lakshman and Prashant Malik to power the Facebook Inbox Search feature. Lakhsman was also one of the author of Amazon’s Dynamo paper.

Cassandra was released as an open source project on July 2008 in Google Code. On March 2009 it became Apache an Incubator project and graduated as an Apache top level project in February 17, 2010.

The Cassandra name was taken from the Greek mythological prophet.

The current licese of Apache Cassandra is Apache License version 2.0

Several key features that Cassandra has :

Massively scalable architecture – a masterless design where all nodes are the same, which provides operational simplicity and easy scale-out.
Active everywhere design – all nodes may be written to and read from.
Linear scale performance – the ability to add nodes without going down produces predictable increases in performance.
Continuous availability – offers redundancy of both data and node function, which eliminate single points of failure and provide constant uptime.
Transparent fault detection and recovery – nodes that fail can easily be restored or replaced.
Flexible and dynamic data model – supports modern data types with fast writes and reads.
Strong data protection – a commit log design ensures no data loss and built in security with backup/restore keeps data protected and safe.
Tunable data consistency – support for strong or eventual data consistency across a widely distributed cluster.
Multi-data center replication – cross data center (in multiple geographies) and multi-cloud availability zone support for writes/reads.
Data compression – data compressed up to 80% without performance overhead.
CQL (Cassandra Query Language) – an SQL-like language that makes moving from a relational database very easy.

You can see the references section at the end of this article to learn more about the basics and the detail of Cassandra.

Update Base System

Before we install any prerequisites and Cassandra, let’s update our system to latest update available. You can run the command below to get the latest updates.

$ sudo apt-get update
$ sudo apt-get upgrade

Install JDK 8

Apache Cassandra is run on top of Java Virtual Machine (JVM). We’ll install Oracle JDK 8 on the system before we install Apache Cassandra. Apache Cassandra can also run on OpenJDK, IBM JVM and Azul Zing JVM.

We will install Oracle JDK using the Webupd8 team team PPA repository.

First step, add the webupd8team ppa repository :

$ sudo add-apt-repository ppa:webupd8team/java
...
Press [ENTER] to continue or ctrl-c to cancel adding it

...
OK

You need to press enter to continue adding the webupd8team PPA repository. The output has been truncated above to show you only the most important part.
Let apt-get download and read the metadata of webupd8 repository:

$ sudo apt-get update

Install JDK 8.

$ sudo apt-get -y install oracle-java8-installer

The -y option above will make you agree automatically with the packages to be installed including dependencies. If you want to check what packages will be installed you can remove the -y option above.

Package configuration. Choose OK

Accepting Oracle Binary Code Lisence Terms. Choose Yes

After installing Java 8, you can check the current java version by running command below :

$ java -version
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

Install Apache Cassandra

Since the system is already at the latest update and we already installed Oracle JDK 8 on the system, we can start installing Apache Cassandra now.

First of all, let’s add the DataStax repository key.

$ curl -L http://debian.datastax.com/debian/repo_key | sudo apt-key add -

Add DataStax Cassandra repository to a new apt source list.

$ echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

Make apt-get read the metadata of Cassandra repository.

$ sudo apt-get update

Now let’s install Cassandra 2.2. This is the latest stable version of Cassandra at the time of writing. You can check Planet Cassandra Download Page for information on the latest stable relase of Apache Cassandra.

$ sudo apt-get install dsc22=2.2.3-1 cassandra=2.2.3

This is optional, we’ll install Cassandra utilities.

$ sudo apt-get install cassandra-tools=2.2.3

You can check the Cassandra service using the command below

$ sudo service cassandra status
 * could not access pidfile for Cassandra

The information above is actually false. The Cassandra process is running but it report that it could not access Cassandra pidfile. This is due to bug on the Cassandra init script. We’ll fix this in the next section.

We can also check cluster status using nodetool command :

$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns    Host ID                               Rack
UN  127.0.0.1  179.29 KB  256          ?       7cd1bdc4-8bfa-49d9-a453-e0cf83bf956f  rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

Let’s try connecting to Cassandra server using cqlsh. You can use the command below

$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.2.3 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> quit

We will not do anything now, so just type quit on the cqlsh shell.

Fixing Service Status

The init script on Ubuntu is broken when being use to check the service status. You can check by running command below :

$ sudo service cassandra status
 * could not access pidfile for Cassandra

This bug is already acknowledged by the Cassandra team. You can see the CASSANDRA-9822 issue or Issue 63 on GitHub.
The fix is pretty simple. Open /etc/init.d/cassandra. You need to use sudo to edit this file. Find line below (should be on line 60) :

CMD_PATT="cassandra.+CassandraDaemon"

change it to:

CMD_PATT="cassandra"

After changing this line, if you check the service status it should return correct status info.

$ sudo service cassandra status
 * Cassandra is running

Configure the Apache Cassandra Cluster Name

The default cluster name for Cassandra is Test Cluster. In this section, we’ll change cluster name to something else. First of all we need to stop Cassandra and delete all the data.

$ sudo service cassandra stop
$ sudo rm -rf /var/lib/cassandra/data/system/*

Edit cassandra configuration. open /etc/cassandra/cassandra.yaml. Find line below :

cluster_name: 'Test Cluster'

Change it to your cluster name that you want to use. In this example we change the cluster name to HostPresto Cluster

cluster_name: 'HostPresto Cluster'

After changing Cassandra cluster name let’s start Cassandra and check the service status

$ sudo service cassandra start
$ sudo service cassandra status
 * Cassandra is running

To make sure the cluster name is already changed, let’s connect again using cqlsh :

$ cqlsh
Connected to HostPresto Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 2.2.3 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh>

We see above that cqlsqh is already connected to HostPresto Cluster

Using Apache Cassandra

Let’s try our Cassandra installation by creating a movie database. First of all let’s create a keyspace, this is a namespace for tables. The keyspace name below is **moviedb***.

cqlsh> CREATE KEYSPACE moviedb
   ... WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };

Use the keyspace that we just created.

cqlsh> use moviedb;

Create the movies table.

cqlsh:moviedb> CREATE TABLE movies (
``` language-bash
           ...  id int PRIMARY KEY,
           ...  title text,
           ...  year text
           ... );
</code></pre>

<pre><code><br />Let's describe the table that we just created :
</code></pre>

cqlsh:moviedb> DESC movies;

CREATE TABLE moviedb.movies (

<pre><code class="language-bash">    id int PRIMARY KEY,
    title text,
    year text
</code></pre>

) WITH bloom_filter_fp_chance = 0.01

<pre><code class="language-bash">    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';
</code></pre>

CREATE INDEX movies_title_idx ON moviedb.movies (title);

<pre><code><br />The table is ready, now time to add some data.
</code></pre>

cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (1, 'Birdman','2014');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (2, '12 Years a Slave','2013');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (3, 'Argo','2012');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (4, 'The Artist','2011');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (5, 'The King''s Speech','2010');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (6, 'The Hurt Locker','2009');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (7, 'Slumdog Millionaire','2008');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (8, 'No Country for Old Men','2007');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (9, 'The Departed','2006');
cqlsh:moviedb> INSERT INTO movies (id,title,year) VALUES (10, 'Crash','2005');

<pre><code>Let's see everything that we added to the table using ```SELECT```.

cqlsh:moviedb> SELECT * from movies;

id | title | year
—-+————————+——
5 | The King’s Speech | 2010
10 | Crash | 2005
1 | Birdman | 2014
8 | No Country for Old Men | 2007
2 | 12 Years a Slave | 2013
4 | The Artist | 2011
7 | Slumdog Millionaire | 2008
6 | The Hurt Locker | 2009
9 | The Departed | 2006
3 | Argo | 2012

(10 rows)


We need to create an index for title column so we can search based on title.

cqlsh:moviedb> CREATE INDEX on movies (title);
cqlsh:moviedb> SELECT * FROM movies WHERE title = ‘Argo’;

id | title | year
—-+——-+——
3 | Argo | 2012

(1 rows)
cqlsh:moviedb>
“`
That’s it, the basic usage of Apache Cassandra.

Try Cassandra Online

If you want to try Cassandra first without installing Cassandra on your computer or server you can use Try Cassandra.

References

Summary

In this tutorial we learned how-to install Apache Cassandra from the Datastax repository. Do some basic configuration and also the basic usage of Apache Cassandra. Now you can start exploring this single node installation of Apache Cassandra. Have Fun!

Want your very own server? Get our 1GB memory, Xeon V4, 25GB SSD VPS for £10.00 / month.

Get a Cloud Server