Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[] Service issue and permissions issues yesterday.

Hello everyone,

  We just wanted to make sure that everyone was up to date on what happened yesterday.

Starting at around 2pm EST one of our network switches became unstable and eventually crashed.  This meant certain backends were no longer accessible resulting in availability issues for some of our front end services.  

As this progressed it was also determined that our permissions sync tooling was being unexpectedly aggressive in how it handled this situation as it began removing permissions in an effort to fail safe(to the lowest possible permissions in this case).

We were unable to recover the switch remotely and as such a member of the IT team was dispatched to the IDC in order to attempt a manual recovery.  The permissions sync tooling  has also halted while our internal team began creating an emergency patch to better handle this situation and to prevent further changes to permissions.

The switch was successfully restarted about 3:30pm and once it was operational services began to return to normal, although a few did later require restarts to clear cached connection failures.  This also allowed the permissions sync tooling to begin restoring the removed permissions.

Per our standard practice we posted updates to during this incident in an attempt to keep the community up to date.  A couple of notes were also sent to the cross-projects mailing list in an effort to broaden the number of people in the community that were aware of the status.

At this time all services should be back to normal and the root cause appears to have been a packet storm that overloaded an older switch while another networking component was being configured.


Back to the top