On November 28, 2022 at 07:35 CET, a delay began to occur in the delivery of a percentage of the signals to the alarm centers.
This fact does not trigger customer alerts since there is no accumulation of signals in the queues of the alarm centers.
No signals have been lost, but a percentage of them have suffered a delay in delivery and a smaller percentage have suffered a delay of more than 30 minutes.
The error is located. A metrics service is taking longer than usual, causing message batches larger than 8, 9, and 10 messages to time out and the message to be requeued for processing.
These batches of messages cause more timeouts as the large batch processing queue fills up, causing the queue to grow continuously.
After checking the diagnosis of the problem, a patch is made to the configuration that reduces the message batch limit to 1 and extends the process timeout to a reasonable amount.
The service is reconfigured with the new parameters and the queued signals are reprocessed correctly, lowering the number of unprocessed messages. The process queue is not incremented as before.
All delayed signals have already been successfully reprocessed.
A non-essential service take more time than expected
Change on system configuration
Many tickets has been added to mitigate the problem if any non-essential service is delayed or missing.
Besides, multiple metrics and alerts has been created to identify this issues in advance.
A protocol has been implemented to generate a incident on status page as soon as problem is detected.