Overcoming Cassandra write performance problems

Cassandra became a household name for the projects involving processing massive amounts of data in NoSQL environment. Cassandra, while shining at supporting the biggest distributed clusters installations, “this comes at the price of high write and read latencies.”

In a recent scenario faced by our customer, Cassandra latencies were increasing dramatically during peak hours and clients started receiving timeout exceptions. Bringing bigger hummer, which for Cassandra means adding more nodes, did not really help.

In the ensuing drama, developers blamed operations team and vice versa:

developers complained that “operations team have misconfigured the production Cassandra cluster”
while operations team were claiming that “there are no obvious issues as CPU utilization is below 60% across all cores”

That was the moment the our experts specializing in Cassandra were requested to intervene. Here comes their story.

Cassandra Environment

Cassandra 3.0.14 cluster deployed on AWS r4.2xlarge instances throughout three Availability Zonesusing the AWS-recommended Ec2Snitch, with each AZ mapped to a rack in Cassandra.

Research

First Glance

All problem were related only to counter tables.
All problems were related to writes.
COUNTER_MUTATION was increasing during peak hours only – which indicates that Cassandra is not able to keep up in general.

Utilizing Request Tracing

We started from evaluating each stage of the write process.

Cassandra comes with a feature called request tracing. This a very useful troubleshooting tool because Cassandra write path is quite complicated.

Tracing has shown that all node-local writes complete within a 200 sec (thanks for to powerful AWS machines), but it also demonstrated that communication between request coordinator and other nodes is inconsistent.

Thanks to Ec2Snitch, writes went to different AZs. AWS does a really good job in terms of cross-AZ latency, but still it varies (in our case the worst latency between two AZs was 600 sec).

An obvious suggestion would be to ask developers to reduce the consistency level, but since all the tables involved were counter tables this was not really an option (RF=3) – read on why you cannot specify consistency level = ANY.

Checking Networking

Since we saw both cross-AZ communication between the nodes and different latency between different AZs, it was quite straightforward to first check all network-related configuration.

Cassandra is a very network-intensive application – and the fact that these AWS machines were not using ENI, made it an immediate suspect.

Even when bandwidth is not a limitation, network I/O induced latency may still appear – non-ENI driver quickly making network a bottleneck. There are many reasons why enabling ENI is extremely important, the biggest one is the subject of multiple RX/TX queues for the OS to deal with. Enabling ENI allows to leverage multiple CPUs to handle network transmission and reduce internal locking contention.

Once all machines were updated with ENI interfaces, the peak hours latency spike halved and RPS threshold at which latency started to build up increased by 35%.

Not Enough

However COUNTER_MUTATION was still increasing during peak hours causing client timeouts.

Cassandra network coalescing allows to reduce overall number of packages sent across network at the cost of increasing latency – however with reasonable queuing this can help with overall latency in case network stack is an issue. Fine-tuning it for higher latency cost has increased our peak RPS by 5%.

Good, but still not enough. Now we wanted to have a clear answer on whether network I/O is related more to the OS or to Cassandra. Even though that was not a fully comprehensive test (our goal mainly was to make sure that ENI driver works well – we did not bother on how much did kernel spend on queuing or iptables) there were a few handy sources of information to look at.

For incoming traffic – /proc/net/softnet_stat, specifically values 2 and 3 (dropped and time_squeeze) can give a hint whether incoming packet processing is starving.
For outgoing traffic – /sys/class/net/NIC/queues/tx-QUEUE_NUMBER/byte_queue_limits can be useful to check how transmit queues are being utilized. Great explanation here and here.

Conclusion: networking stack with ENI driver showed no latency problems during peak hours.

By that time we already increased peak RPS by enabling ENI and fine-tuning network coalescing feature. Now we wanted to know what else can be done to increase peak RPS even further.

Next step: with Cassandra employing SEDA architecture it is very useful to look at thread pool statistics – for each thread pool there are stats of active, pending and blocked tasks available via JMX.

In our case during peak traffic times we had spikes in number of blocked native transport requests.

There is a worthy discussion on that matter, but the key point for us was that we hit the limits of native transport threads (128 by default) and thread queues (1024). Which meant nodes coordinator simply stopped accepting requests and Cassandra reported blocked events on native transport requests thread pool.

We already excluded networking stack associated latency (potentially overloading native transport requests pool) – so we hit the wall.

Catching the Culprit

CPU is always a suspect, but here was no smoking gun: the CPU utilization across all cores was about at 60-70% during peak traffic times.

We decided to check an idea that with Cassandra SEDA architecture and multiple thread pools there may be too many threads contending for available 8 cores.

To prove that point we used perf sched. Originally this tool was developed to help tracing thread scheduling delays which could negatively impact UI responsiveness. But in our case this was a perfect tool as we could collect timing for threads waiting to be scheduled.

Bingo – during peak hours internal Cassandra thread contention was becoming a problem. More importantly, it was not immediately visible in CPU utilization was because these spikes happened in bursts, and tools like top just were not precise enough to catch CPU contention. But with captured trace of thread scheduling delays we had a perfect proof the CPU being a bottleneck.

Happy End

Once the problem was explained, the decision was to move in two steps:

Task for operations team – immediately upgrade to AWS instances with higher number of cores.
Task for developers – code change. Specifically, redesign schema because additionally, due to poor partition key choice at each point in time there were “hot” nodes received most of the write traffic while other nodes were idling.