Date of incident: Monday, September 14, 2020
Time of incident: 3:30pm AEST on September 14 to 12:00pm AEST on September 15
Incident Report: September 14, 2020
During the above timeframe, there was an issue with the ClickSend SMS gateway which caused message delays and some rejections. At 3:30pm AEST on September 14 we deployed a change to our sending infrastructure which caused an issue with how the sender workers started the tomcat service, which impacted our message routing. While we have a number of sender workers for redundancy and failover, our deployment of these senders did not operate as we would have expected, and we had to work to stabilise senders in order of priority manually.
• 3:30pm - 5.40pm AEST on September 14, Majority of messages were queued during this time, with some message rejections.
• 3:30pm - 7.24pm AEST on September 14, All Messages were delayed processing.
• 3:30pm AEST on September 14 - 12:00pm AEST on September 15, A Subset of 147K messages across 5K customers were delayed
Our technology and operations teams actively began working to resolve the issue immediately. They restored all outbound message sending from 5:40pm AEST on September 14. Customers would have experienced message delays until 6.40pm on Tuesday, September 14, a smaller subset of customers would have experienced delays in message sending until we replayed at 12pm AEST on September 15. The Sender workers each then received a fresh instance to remove the patch applied on the September 14.
We are confident this incident will not occur again. In the unlikely event it does, these enhanced controls will also help us to recover significantly faster, we have enhancements to the controls and processes around deployments.
• Deployment pipeline will be split into priority sender tiers with health stack sign off on each, removing the dependency on QA Automation.
• Monitoring over the Discard events which will allow us to track and replay messages significantly faster.