Incident Report - 29 March 2018

Incident from: 28 March 2018 - 7:52 PM CEST (+02:00 UTC)

to: 29 March 2018 - 9:38 AM CEST (+02:00 UTC)

Dear Customer,
 
Despite all our efforts to prevent such events from happening, we had a major incident that occurred yesterday evening, 28 March 2018 at 7:52 PM CEST (+02:00 UTC)
that could not be resolved until this morning at 29 March 2018 at 9:38 AM CEST (+02:00 UTC).
 
We wanted to come back to this problem in all transparency, explain it to you in detail, tell you what we hired so that such a problem would never happen again. 
 
Please know that we fully understand that it must have been very difficult to carry out some of your operations, and we present our apologies for all the inconvenience this has caused you and your teams. 
 
What happened: 

  • Microsoft Azure "recycled", as it is its right, some servers, including all our servers ensuring the routing of our services. 
  • During this operation, and specifically yesterday, the server configuration was modified in a way that was not planned.
  • This reconfiguration made the Java service, on our servers, unstable. 
  • As a consequence none of the redundant servers carrying the routing service was not able to restart properly and handle any request, 
  • Actually all the servers running truly our cloud applications were up and running, but the routing service that provides access to our services was not.
  • Unfortunately this problem got worse as a result that the first major alarm that was raised at that time, for our cloud infrastructure, by our monitoring system was raised on a secondary service instead on firing it for the router. 
  • This alarm has masked the main alarm that should have been processed for far too long.
  • Even after the main problem was addressed, the existing procedures for restarting and redeploying services did not work.
  • And finally we had to recreate a complete new routing service. Which was finished this morning at 9:38 AM CEST (+02:00 UTC).

After an analysis of the problem encountered and to be sure that this kind of error will not happen again, we decided to:

  • Rebuild a routing service configuration that will ensure that the Windows service deployed by Azure can no longer influence the Java service required for the routing service.
  • Reconfigure our alarm system so that a major alarm on a subordinate service can no longer prevent us from receiving a major alarm on our routing service.
  • Document a procedure for rebuilding from scratch the routing service, in case of another error leading to the same type of unavailability for the routing service, so that it can possibly be executed by the level one production team in a time that will be less than 15 minutes.  

We are confident that these measures will ensure that this form of incident is not repeated and handled in correct way. But in addition, and in order to verify if we cannot do even better than the first measures taken, we also decided to launch several studies to see if we could not add other forms of redundancies and security on this routing service.
 
We are really sorry for the impact that this failure on such a simple and basic system that is in front of all our applications could have had for you. We can assure you that all our teams solutions, support, development, quality and production beyond obviously making that errors of the same type cannot happen anymore, are focused to make all their possible so that such events do not occur.

Kind Regards,

Patrick CHAUVEL
Chief Customer Officer