Data ingestion delay - ~15 min behind real-time ingestion

Incident Report for Coralogix

Postmortem

Postmortem: July 3, 2019

‌

Summary

During July 3rd 2019, 08:25-14:50 (UTC) some of our servers became unavailable, causing degraded performance in our log processing system. This resulted in a data lag which reached a max of 1.5 hours and gradually resolved within a 6.5 hour period.

Impact

During a period of 6.5 hours, there was a data indexing lag of up to 1.5 hours.
Livetail was unavailable for 1 hour
At the end the 6.5 hour period all data was processed without any data loss

Root Causes

During a routine Kafka partition reassignment, we encountered an I/O disk lock issue, the lock was caused due to a leader skewed in our largest topics, leading to a miscalculation of the throttling parameter we applied. Soon after the process began, we experienced a spike in data reads which left our service unable to write to Kafka due to our disk exceeding its burst balance.

When the Kafka brokers were restarted to resolve the issue, the restart caused some of the brokers to receive shadow IDs which caused it to rebalance continuously (this is a known bug in Kafka 1.0.0). This resulted in about only 10% of logs being processed for the duration of the 6 hour period.

Detection

We began to see an ingestions lag in our Loggregation service.
Kafka brokers were showing 300% CPU instance load which was gradually increasing
The instance’s log flushing spiked to 2000% above average

Resolution

Gradually restarted Kafka brokers with I/O locks
Loggragation service was paused for 50% of data
Scaled up data processing resources to support increased Elastic processing‌

‌

Action Items:

Rolling restart of problematic Kafka brokers
Shutoff the Loggregation service
Manually balance the Kafka cluster by reassigning partitions from the hot brokers

Lessons Learned

What went well

Due to our Kafka architecture which ensures that the data is kept and replicated prior to processing, there was no data loss risk at any point.
Our monitoring quickly alerted us on the data processing lag and it’s source.
We immediately notified our customers

What went wrong

We miscalculated the server’s capacity during the partition reassignment operation
Although we were familiar with the known Kafka 1.0.x bug with shadow IDs, we weren’t aware that it would be triggered by the rolling Kafka restart operation.
One of our Kafka streams services, wasn’t resilient to the Kafka read lock issue.

Timeline

‌June 3, 2019, 08:25 (UTC) - We started reassigning our Kafka partitions for load balancing

June 3, 2019, 08:26 (UTC) - We identified that the Loggeration service was lagging behind due to multiple Kafka broker timeouts

June 3, 2019, 08:30 (UTC) - Declared incident due to cascading failure and coordinated with the CTO every few minutes

June 3, 2019, 08:31 (UTC) - Tested Loggegration service for root causes and determined there was another cause

June 3, 2019, 08:35 (UTC) - Inspected Kafka logs and metrics, and identified three brokers which were malfunctioning

June 3, 2019, 09:23 (UTC) - Began a rolling restart Kafka brokers in order to avoid data loss. Due to the disk burst balance being exceeded, the restart took longer than usual.

June 3, 2019, 10:55 (UTC) - After the first restart we encountered the ‘Cached zkVersion’ Kafka bug which caused all services to continuously rebalance and thus reduced processing to about 10%

June 3, 2019, 11:06 (UTC) - Restarted 3 brokers that had references to outdated IPs

June 3, 2019, 11:07 (UTC) - Kafka and all services were back online but 90% of data since 12:00 pm was backlogged and needed to be processed by Kafka consumers.

June 3, 2019, 14:27 (UTC) - Loggregation was back online

June 3, 2019, 14:50 (UTC) - Issue was resolved and all data was up to date

References

https://community.hortonworks.com/questions/21850/hdp-232-kafka-brokerstopics-suddenly-becomes-unava.html

https://issues.apache.org/jira/browse/KAFKA-2729

https://issues.apache.org/jira/browse/KAFKA-3042

Posted Jul 07, 2019 - 13:13 UTC

Resolved

The issue was resolved, data should be present up-to-date and small gaps are gradually closing.

Posted Jul 03, 2019 - 15:50 UTC

Update

We are continuing to monitor for any further issues.

Posted Jul 03, 2019 - 14:53 UTC

Monitoring

The main issue was resolved and data started flowing back in. ETA is 1 hour to real-time ingestion.
Our team is monitoring the progress and we'll update accordingly.

Posted Jul 03, 2019 - 12:06 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 03, 2019 - 10:37 UTC

Identified

We are investigating a critical networking issue causing data ingestion to delay of 60 minutes to some of our customers, our team is currently collecting information in order to resolve the problem, no ETA atm.
We are sorry for the inconvenience and will publish a relevant analysis post this event if applicable.

Posted Jul 03, 2019 - 09:47 UTC

Monitoring

Our team found the issue and solved it, all data should be available. We are now monitoring the system and we'll update shortly.

Posted Jul 03, 2019 - 09:17 UTC

Identified

We are investigating an issue causing data ingestion to delay by ~30min to some of our customers, our team is currently collecting information in order to resolve the problem and we do not expect an escalation in the delay size.
We are sorry for the inconvenience and will publish a relevant analysis post this event if applicable.

Posted Jul 03, 2019 - 08:52 UTC

This incident affected: Log Processing.