On January 15, 2023, Coralogix experienced an issue that prevented some customers from connecting to the Coralogix data-store in our AWS USA (us-east-2) cluster.
As part of the ongoing optimization that Coralogix undertakes every day on our platform, we were optimizing our resource usage and storage in the AWS us-east-2 region. At 07:42 GMT+2, this work unfortunately led to our Open-search master nodes losing connection to the cluster, which put the entire cluster into an unknown state. This prevented some of our customers from connecting to the Coralogix data-store.
During the entire time of the incident - Alerts, Live Tail, Logs2Metrics, Archive queries were not affected and functioned as usual.
Our first priority was the preservation of cluster state, so the Platform team scaled down our master nodes. This enabled us to reconstruct cluster state. Once the state was reconstructed, by 10:19 GMT+2, the cluster began functioning as expected and was indexing new data. By 13:21 GMT+2, all shards were initialized with old data and all data was available.
Customers were unable to retrieve data from the Coralogix data-store between 09:38 GMT+2 and 10:19 GMT+2. Older data (more than 3 days) was unavailable until 13:21 GMT+2.
07:42 GMT+2: Upgrading infrastructure.
09:34 GMT+2: Began rolling upgrade of OpenSearch master nodes
09:38 GMT+2: Leader node lost connection. This is when the incident first began.
09:45 GMT+2: Decision to scale down master nodes, and reconstruct cluster state. data-store resources are freed, cluster is able to react to both queries and indexes.
10:01 GMT+2: Shards successfully initialize for real time data processing.
10:19 GMT+2: Data indexing begins again. This is when the incident has been resolved
11:16 GMT+2: Recent customer data is available without lag.
13:21 GMT+2: Old data is made available, finalizing the issue.
There are some key areas we will improve to ensure that this issue does not happen again.
At Coralogix, we operate with transparency at the heart of what we do. In this instance, however, we did not anticipate this outcome, and so we didn’t communicate enough information to our customers. We will improve our communication processes, and any changes of this nature in the future will be communicated first.
Our network is extensively covered by alerting, however this issue did not immediately trigger them. While they did trigger eventually, we still intend to add new alerts to our data-store that will detect similar behavior in the future, and enable us to act faster.
Our objective is to build a platform that our customers can trust, and one that constantly delivers on the features that will give our customers the tools they need. This is a continuous, iterative process. In this instance, while ensuring our platform is as robust and consistent as possible, we have caused a short period of unavailability. For this, we apologize to our customers for the inconvenience caused, and we hope that our remediation will give you confidence that this issue will not happen again.