Cassandra is a distributed NoSQL database in simple terms. By distributed, it means Cassandra has multiple inter connected nodes responsible for storing and retrieving the data. Multiple nodes are required to achieve availability, scalability and high performance which Cassandra is famous for.
When its comes to distributed systems, we need to understand CAP(Consistency, Availability & Partition Tolerance) theorem. According to it “any distributed data store can only provide two of the these three(i.e. CAP) guarantees”. Since network failures are bound to happen in distributed systems we need to take care of “Partition Tolerance” for sure keeping the multiple copies of data across nodes configured by replication factor. So there are 2 choices, either give preference to availability or consistency when network failure occurs.
In general, Cassandra is mentioned as AP(Availability & Partition tolerance) database since thats the Cassandra is the best fit for. However higher consistency is tunable sacrificing the availability and performance in Cassandra for the some of the use cases where it is required but thats not Cassandra is a good fit for.
Basic Terms used in Cassandra
- Node : A server where Cassandra software runs on and subset of data is stored.
- Cluster : Collection of many nodes serving as one Cassandra database.
- Column : Smallest storage unit in Cassandra to store data field’s key and value, known as Cell as well.
- Column family : Collection of columns for a particular use case. It’s same as Table.
- Partition : It’s a subset of data on a particular node for a table. A node can have multiple partitions.
- Partition Key : Set of columns for identifying the placement of data in a particular partition of a node.
- Clustering Key : Set of columns for storing the multiple records in sorted order.
- Row Key : Primary key to uniquely identify a record in database. Primary key = Partition Key + Clustering Key
- Keyspace : Storage unit for consisting multiple column families similar to a schema in a relational database
It’s possible not to have any clustering key columns having row/primary key same as partition key. Though primary purpose of partition key is not to work as primary key but to distribution the data across different nodes in different small partitions. However partition key should definitely not have low cardinality and whether you should have very high cardinality or not depends on your use cases of querying the data and data size.
Following diagram will make it more clear when we have clustering key(here its a timestamp field) :
What Cassandra is good at
- Highly available : Cassandra doesn’t have single point of failure i.e. always available for read and write given majority of nodes are still available and connected. This is achievable because there is no master-slave topology in C*, all nodes are same(of course data on those nodes can be different with replication on few nodes in the cluster). So even if you don’t have strong consistency, scalability and performance requirements and just want to have very highly available database across different regions, Cassandra is a good choice. Thats why Kong(API Gateway) also uses it as one of the database choices.
- Horizontally scalable and performant for write load : This is achievable because we can have multiple nodes to handle the write loads for different partitions and also within a single node, Cassandra writes new data to a file called commitlog in append only fashion which makes writes to disk faster(same as Kafka). Other than this Cassandra uses memtables to store new data in memory and then flushes the data from memory to data files called SSTable periodically. So commitlogs are used for node failure scenarios only.
- Horizontally scalable for data size : This is achievable because we can add more nodes in the cluster to handle more data.
- Horizontally scalable for read load(conditional) : Querying data from Cassandra based on single partition key is fast and scalable but there are some factors which may impact the performance and scalability. For example if there is too much data in a single partition or data is being updated for same partition key causing data to reside in different data files(SSTables) for same partition. Though there are different SSTable compaction strategies to minimize the impact but things need to be tested carefully depending on exact use cases along with data modeling which is the most important part while using Cassandra database.
What Cassandra is not good at or does not support it at all
- ACID(Atomicity, Consistency, Isolation, Durability) compliance : Not supported which are required for applications like financial systems.
- Aggregation functions : Cassandra is not a good fit for aggregation like count, max, avg etc.
- Joins : Not supported, though data denormalization is a common practice in Cassandra to keep the related data for a query in one partition but that should be within limits.
- Sorting : Cassandra supports sorting only on clustering columns and you can’t not change the value of clustering column once inserted. So Sorting in Cassandra is a design decision.
- New use cases : We need to store data in Cassandra as per query patterns so if any new requirement comes, it’s not easily possible to handle it unless we store the data again as per Cassandra data modeling requirements.
Summary
In this post we learnt about basics of Cassandra, it’s use cases and where we need to be careful or avoid using Cassandra.
We will cover more advanced scenarios and use cases in upcoming posts, until then make sure to give your claps for this post and follow me to see more such posts in future.