Signals Delayed On Delivery
Incident Report for cloudsurgard
Postmortem

Issue Summary

On November 28, 2022 at 07:35 CET, a delay began to occur in the delivery of a percentage of the signals to the alarm centers.

This fact does not trigger customer alerts since there is no accumulation of signals in the queues of the alarm centers.

No signals have been lost, but a percentage of them have suffered a delay in delivery and a smaller percentage have suffered a delay of more than 30 minutes.

Timeline

2022-11-28@08:25 EST

The error is located. A metrics service is taking longer than usual, causing message batches larger than 8, 9, and 10 messages to time out and the message to be requeued for processing.

These batches of messages cause more timeouts as the large batch processing queue fills up, causing the queue to grow continuously.

2022-11-28@08:55 EST

After checking the diagnosis of the problem, a patch is made to the configuration that reduces the message batch limit to 1 and extends the process timeout to a reasonable amount.

2022-11-28@09:10 EST

The service is reconfigured with the new parameters and the queued signals are reprocessed correctly, lowering the number of unprocessed messages. The process queue is not incremented as before.

2022-11-28@09:50 EST

All delayed signals have already been successfully reprocessed.

Root Cause

A non-essential service take more time than expected

Resolution and recovery

Change on system configuration

Corrective and Preventative Measures

Many tickets has been added to mitigate the problem if any non-essential service is delayed or missing.

Besides, multiple metrics and alerts has been created to identify this issues in advance.

A protocol has been implemented to generate a incident on status page as soon as problem is detected.

Posted Nov 28, 2022 - 21:00 CET

Resolved
On November 28, 2022 at 07:35 CET, a delay began to occur in the delivery of a percentage of the signals to the alarm centers.

This fact does not trigger customer alerts since there is no accumulation of signals in the queues of the alarm centers.

No signals have been lost, but a percentage of them have suffered a delay in delivery and a smaller percentage have suffered a delay of more than 30 minutes.

Incident resolved, tickets and metrics has been posted to avoid same issue on the future
Posted Nov 28, 2022 - 08:30 CET