Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [cross-project-issues-dev] download.eclipse.org unavailable

Thank you both for your perspectives. It is understood that we can do a better job of communicating, both during and after outages. Such events are convenient for absolutely no one, and we try very hard to avoid them in the first place. We feel your pain.

I do feel the need to remind folks of the Service Level Agreements for the numerous services provided by Eclipse Foundation IT:

https://wiki.eclipse.org/IT_SLA

CBI is a Tier II - Best Effort service, and considering the Foundation's "shut down" status during that week, I think we fared pretty well. Strategic members of the EF can alert IT staff directly, outside of business hours, using SMS text, to expedite resolution of Tier II & Tier III service outages. If CBI, or other parts of the Eclipse Foundation's Infra is instrumental to your business, please consider Strategic Membership, as it has many benefits.

https://www.eclipse.org/membership/


Denis





On 2021-08-13 9:38 a.m., Christoph Läubrich wrote:
Thanks Ed for the detailed time-line. I also can confirm that (from the point of a simple comitter POV) the outage was not over at Aug 2 (maybe for me 'core services' are just others than from the infra-POV) but has last far to 4 Aug and I could continue the work on my issues.

So for me the summary "The outage was extensive, and for core services, lasted for approximately 18 hours. Non-core services were degraded for an additional 12 hours." does not feels quite right but as said before I can't 'proof' that, its jsut that actually I was only able to resume my work at Aug 4 (120hrs later!) at laest until the tycho-ci server was restarted ...

so for me it seems a check "are all build servers running and have executors" is missing from the status page.

Am 13.08.21 um 15:22 schrieb Ed Willink:
Hi

Thank you all for hitting problems quite quickly once you were engaged. Perhaps this 'bystander's' perspective may help to understand the need to communicate better.

I first became aware of the problem after receiving notification a little after 2:42 EDT 1-Aug that a weekly OCL rebuild had failed. Investigation of the log pointed a finger at the GIT repo and eclipsestatus.io indicated that a major outage was in progress with an 'investigating' tweet. Clearly someone was on the case and so the bystander effect took over and I didn't raise any reports or emails to distract.

'investigating' status advanced to 'fix-in-progress' after an hour.

But then nothing for a further 5 hours, at which point we got 'it will take 13 hours'. On twitter someone asked when the 13 hours started; one might have hoped that it would be from the 'fix-in-progress' time. This tweet and an 'ETA?' tweet were never answered.

17 hours later we got 'most websites' back, which might be true but with important  services down, it was misleading. It took a further perhaps 4 hours forhttps://download.eclipse.org/tools/orbit/downloads/latest-I <https://download.eclipse.org/tools/orbit/downloads/latest-I> to return, and 50 hours before projects-storage.eclipse.org <mailto:genie.modisco@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> was back and another couple of hours to get /shared/common/apache-ant-latest/bin/ant back.

IMHO the outage lasted until at least the restoration of projects-storage.eclipse.org <mailto:genie.modisco@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> at Aug 4 8:50 and so one of the issues to be addressed by the postmortem must be why the status page still reports no incidents or outage on the whole of the 3rd Aug when, for committers at least, there was no useable service all day.

I must thank the team again for their hard work with a very difficult problem, but must also stress that the communication was very poor. So much so that at 3:07 EDT on 4th Aug I sent a private email to Ed Merks speculating that:

/The total silence from the team is now way beyond incompetence/discourtesy/embarrassment; there must be another reason. //
////
//Paranoia sets in. //
////
//Is some government / hostile agency intervening to prevent communication? //
////
//Are the team voluntarily maintaining silence to contain a security issue? /

Please ensure that whenever possible the status updates are much more informative.

     Regards

         Ed Willink


On 09/08/2021 21:45, Denis Roy wrote:

I very much appreciate the sympathy and the support. In the end, the Infra team can do better than this.  We'll lick our wounds and go back to the drawing board to make sure we don't repeat the same mistakes twice.

Postmortem is written, pending review with my team.



Denis



<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>     Virus-free. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>

<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev

_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev
--

Denis Roy

Director, IT Services | Eclipse Foundation

Eclipse Foundation: The Community for Open Innovation and Collaboration

Twitter: @droy_eclipse


Back to the top