On April 16th, 18th, and 19th, 2024, our US customers experienced file processing delays. During this time our engineers were focused on fixing the underlying issue as well as getting new work moving as soon as possible. Whenever new work was able to process normally they focused on the backlog of impacted jobs. During this process multiple issues were discovered and addressed.
On April 15th, 2024, a routine security upgrade for our file processing service was deployed. This upgrade unexpectedly caused jobs to be stored longer than normal and caused some failed work to continuously retry. These issues reduced the processing capacity of the service which limited the amount of new work we could process at a time, causing work to backup. Multiple temporary fixes were deployed to restore the service's capacity. These temporary fixes allowed new work to process and the backlog to be addressed, but did not resolve the underlying issue which caused the issue to reoccur later in the week.
Once the issues were discovered the temporary fixes were removed and a more permanent fix was implemented.
We've implemented a fix and we are working on improving our alerts to catch these types of issues faster.