Delays in data transfer - USA cluster customers

Incident Report for Coralogix

Postmortem

Overview

On January 15, 2023, Coralogix experienced an issue that prevented some customers from connecting to the Coralogix data-store in our AWS USA (us-east-2) cluster.

Root Cause

As part of the ongoing optimization that Coralogix undertakes every day on our platform, we were optimizing our resource usage and storage in the AWS us-east-2 region. At 07:42 GMT+2, this work unfortunately led to our Open-search master nodes losing connection to the cluster, which put the entire cluster into an unknown state. This prevented some of our customers from connecting to the Coralogix data-store.

During the entire time of the incident - Alerts, Live Tail, Logs2Metrics, Archive queries were not affected and functioned as usual.

Remediation

Our first priority was the preservation of cluster state, so the Platform team scaled down our master nodes. This enabled us to reconstruct cluster state. Once the state was reconstructed, by 10:19 GMT+2, the cluster began functioning as expected and was indexing new data. By 13:21 GMT+2, all shards were initialized with old data and all data was available.

Impact

Customers were unable to retrieve data from the Coralogix data-store between 09:38 GMT+2 and 10:19 GMT+2. Older data (more than 3 days) was unavailable until 13:21 GMT+2.

Timeline

07:42 GMT+2: Upgrading infrastructure.

09:34 GMT+2: Began rolling upgrade of OpenSearch master nodes

09:38 GMT+2: Leader node lost connection. This is when the incident first began.

09:45 GMT+2: Decision to scale down master nodes, and reconstruct cluster state. data-store resources are freed, cluster is able to react to both queries and indexes.

10:01 GMT+2: Shards successfully initialize for real time data processing.

10:19 GMT+2: Data indexing begins again. This is when the incident has been resolved

11:16 GMT+2: Recent customer data is available without lag.

13:21 GMT+2: Old data is made available, finalizing the issue.

Remediation and follow-up steps

There are some key areas we will improve to ensure that this issue does not happen again.

Better communication

At Coralogix, we operate with transparency at the heart of what we do. In this instance, however, we did not anticipate this outcome, and so we didn’t communicate enough information to our customers. We will improve our communication processes, and any changes of this nature in the future will be communicated first.

Alerting on our Networking Infrastructure

Our network is extensively covered by alerting, however this issue did not immediately trigger them. While they did trigger eventually, we still intend to add new alerts to our data-store that will detect similar behavior in the future, and enable us to act faster.

Conclusion

Our objective is to build a platform that our customers can trust, and one that constantly delivers on the features that will give our customers the tools they need. This is a continuous, iterative process. In this instance, while ensuring our platform is as robust and consistent as possible, we have caused a short period of unavailability. For this, we apologize to our customers for the inconvenience caused, and we hope that our remediation will give you confidence that this issue will not happen again.

Posted Jan 18, 2023 - 10:30 UTC

Resolved

This incident has been resolved.

Posted Jan 15, 2023 - 09:39 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 15, 2023 - 09:30 UTC

Update

Real-time data has been retrieved and is now being processed. We are working on processing the older data.

Posted Jan 15, 2023 - 09:02 UTC

Identified

We have identified the issue, and are starting to process the data.

Posted Jan 15, 2023 - 08:12 UTC

Investigating

We are experiencing some delays in data transfer to small group of our usprod1 cluster, we are working on fixing the issue and update accordingly.

Sorry for the delays.

Posted Jan 15, 2023 - 07:38 UTC

This incident affected: Log Processing.