During July 3rd 2019, 08:25-14:50 (UTC) some of our servers became unavailable, causing degraded performance in our log processing system. This resulted in a data lag which reached a max of 1.5 hours and gradually resolved within a 6.5 hour period.
During a routine Kafka partition reassignment, we encountered an I/O disk lock issue, the lock was caused due to a leader skewed in our largest topics, leading to a miscalculation of the throttling parameter we applied. Soon after the process began, we experienced a spike in data reads which left our service unable to write to Kafka due to our disk exceeding its burst balance.
When the Kafka brokers were restarted to resolve the issue, the restart caused some of the brokers to receive shadow IDs which caused it to rebalance continuously (this is a known bug in Kafka 1.0.0). This resulted in about only 10% of logs being processed for the duration of the 6 hour period.
What went well
What went wrong
June 3, 2019, 08:25 (UTC) - We started reassigning our Kafka partitions for load balancing
June 3, 2019, 08:26 (UTC) - We identified that the Loggeration service was lagging behind due to multiple Kafka broker timeouts
June 3, 2019, 08:30 (UTC) - Declared incident due to cascading failure and coordinated with the CTO every few minutes
June 3, 2019, 08:31 (UTC) - Tested Loggegration service for root causes and determined there was another cause
June 3, 2019, 08:35 (UTC) - Inspected Kafka logs and metrics, and identified three brokers which were malfunctioning
June 3, 2019, 09:23 (UTC) - Began a rolling restart Kafka brokers in order to avoid data loss. Due to the disk burst balance being exceeded, the restart took longer than usual.
June 3, 2019, 10:55 (UTC) - After the first restart we encountered the ‘Cached zkVersion’ Kafka bug which caused all services to continuously rebalance and thus reduced processing to about 10%
June 3, 2019, 11:06 (UTC) - Restarted 3 brokers that had references to outdated IPs
June 3, 2019, 11:07 (UTC) - Kafka and all services were back online but 90% of data since 12:00 pm was backlogged and needed to be processed by Kafka consumers.
June 3, 2019, 14:27 (UTC) - Loggregation was back online
June 3, 2019, 14:50 (UTC) - Issue was resolved and all data was up to date