<img height="1" width="1" style="display:none;" alt="" src="https://dc.ads.linkedin.com/collect/?pid=42246&amp;fmt=gif">
header3EN.jpg

29 May 2019

RESOLVED: SaaS disruption several environments unavailble

This problem has been marked as resolved at May 29, 2019 at 2:01:03 PM CEST.

Original post
May 29, 2019 at 8:55:58 AM CEST
We are currently experiencing problems on one of our hosting locations. As a result your TOPdesk environment may not be available.
We are aware of the problem and working on a solution.

Our apologies for the inconvenience. We aim to update this status blog at least every 30 minutes until the issue has been resolved.

Root Cause Analysis 

Time line 

May 29th around 09:00AM CEST the file cluster used for hosting temporary files for TOPdesk environments in the NL3 datacenter became unavailable. This unavailability caused TOPdesk environments to crash when specific actions were executed. The unavailable environments were detected within minutes and TOPdesk started an investigation. 

At 09:23AM CEST the file cluster was back online. All affected environments were restarted. TOPdesk environments that had not restarted could still crash, as there were references to unavailable files in the TOPdesk memory. 

Around 11:00AM CEST the status page was updated informing customers restarts for the remaining customers in the NL3 datacenter would be scheduled between 11:59AM CEST and 13:00 (1:00PM) CEST to prevent unplanned restarts during the rest of the day. Restarts were executed as scheduled, but took a bit longer to complete due to the number of simultaneous requests.

 

Follow-up 

A change to compartmentalize the storage for temporary files was already started, but not yet ready for production. We're coordinating with all relevant teams to ensure this gets fixed with a high priority.

We are also investigating why a file storage cluster designed for high availability came to be unavailable.  Options to notify users working in a TOPdesk environment of an upcoming restart are being investigated.


Update(s):
May 29, 2019 at 2:00:54 PM CEST
All restart jobs were executed around 13:15. Due to the number of restart jobs we saw some performance issues on serveral applicationservers. Application server performance is now back to normal levels. A root cause analysis has been scheduled. RCA details will be posted on the status page next week. If you still experience any (performance) issues, please contact TOPdesk Support.

May 29, 2019 at 1:04:18 PM CEST
Some TOPdesk environments are still restarting. The restarts are taking a bit longer than expected as the servers are busier than normal due to the number of environments that needed to be restarted in a small time frame.

May 29, 2019 at 11:08:04 AM CEST
There are still a lot environments restarting as a result of the disruption earlier today. We are scheduling restarts for all environments in the NL3 datacenter between 12:00 CEST and 13:00 CEST to prevent unplanned restarts during the rest of the day.

Environments that have already been restarted this morning do not need an additional restart. TOPdesk Support will cancel these additional restarts to ensure that environments that are no longer prone to an unexpected restart remain online.

May 29, 2019 at 9:54:23 AM CEST
This morning there was a short disruption in the file storage system used for temporary files for TOPdesk environments. The temporary unavailability of the file storage system in combination with specific settings and actions in TOPdesk can cause TOPdesk environments to crash, even at a later time.

When the TOPdesk environment crashes, TOPdesk will automatically restart and recover after aproximately 10 minutes. After this restart the issue does not reoccur.

TOPdesk Support is monitoring for crashing environments and will restart all affected environments that not recover quickly. The TOPdesk environments that were affected during the file storage disruption did not recover automatically, and have been manually restarted. These environments are now all back online.

To prevent future TOPdesk crashes, all TOPdesk environments using the file storage that malfunctioned this morning will need to be restarted. We recommend customers on the NL3 datacenter that have not experienced any issues yet to schedule a restart for their environment via this form on our Extranet.

May 29, 2019 at 9:23:01 AM CEST
TOPdesk environments were unreachable due to a problem with the temporary file storage. The root cause of this issue has been resolved, and all affected TOPdesk environments are being restarted to resolve the issue.

May 29, 2019 at 9:11:20 AM CEST
Several TOPdesk SaaS environments are unavailable both inside and outside the TOPdesk SaaS network. We are running several tests to determine the root cause of this issue.