<img height="1" width="1" style="display:none;" alt="" src="https://dc.ads.linkedin.com/collect/?pid=42246&amp;fmt=gif">

9 April 2019

RESOLVED: SaaS environments unreachable: timeout

This problem has been marked as resolved at April 9, 2019 at 2:22:07 PM CEST.

Original post
April 9, 2019 at 8:29:55 AM CEST
We are currently experiencing problems on one of our hosting locations. As a result your TOPdesk environment may not be available.
We are aware of the problem and working on a solution.

Our apologies for the inconvenience. We aim to update this status blog at least every 30 minutes until the issue has been resolved.

April 12, 2019 at 1:08:57 PM CEST

Root Cause Analysis 

Time line 

April 9th at 08:00AM CEST the file cluster used for hosting temporary files for TOPdesk environments in the NL3 datacenter became unavailable. This unavailability caused TOPdesk environments to crash when specific actions were executed. The unavailable environments were detected within minutes and TOPdesk started an investigation. 

At 08:30AM CEST the file cluster was back online. TOPdesk environments started before that time could still crash, as there were references to unavailable files in the TOPdesk memory.  TOPdesk operators restarted all unavailable environments. Environments encountering issues after 08:30 restarted automatically.  

At 09:00AM CEST all previously down environments were back online.

Around 11:00AM CEST the root cause of the issue was confirmed. Restarts were scheduled for the next available maintenance window to permanently resolve any remaining issues for TOPdesk environments.  



A change to compartmentalize the storage for temporary files has been started. We are also investigating why a file storage cluster designed for high availability came to be unavailable.  

Several actions have been scheduled to reduce the time before information regarding a disruption is available to customers, including improving internal communication procedures and further automating the process to update the status page.  


TOPdesk SaaS reliability 

We are aware there have been too many disruptions on our SaaS services recently. Even though the root cause of each disruption has been investigated and mitigated, improving the reliability of our SaaS services has our highest priority.  

April 10, 2019 at 9:01:49 AM CEST
All TOPdesk environments using the file storage that malfunctioned yesterday have been restarted this night. We no longer notice any residual issues. Please contact TOPdesk Support if your TOPdesk environment is still showing any errors. TOPdesk will post a root cause analysis on the status page within a week.

April 9, 2019 at 2:21:24 PM CEST
This morning there was a short disruption in the file storage system used for temporary files for TOPdesk environments. The temporary unavailability of the file storage system in combination with specific settings and actions in TOPdesk can cause TOPdesk environments to crash, even at a later time. When the TOPdesk environment crashes, TOPdesk will automatically restart and recover after aproximately 10 minutes. After this restart the issue does not reoccur. The TOPdesk environments that were affected during the file storage disruption did not recover automatically, and have been manually restarted at that time. To prevent future TOPdesk crashes, all TOPdesk environments using the file storage that malfunctioned this morning will be restarted during the next maintenance window. TOPdesk is still investigating the root cause of the storage system disruption. A root cause analysis will be posted on our status blog within 5 workdays.

April 9, 2019 at 9:35:39 AM CEST
The TOPdesk Environments are currently back online, we are still busy with investigating the root cause

April 9, 2019 at 9:08:27 AM CEST
Nearly all TOPdesk environments affected by the disruption are back online. We are starting the last few environments.

April 9, 2019 at 8:52:16 AM CEST
Engineers are investigating a disruption to our storage infrastructure that is affecting TOPdesk availability. We will provide a update in 15 minutes .