[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [cross-project-issues-dev] download.eclipse.org unavailable
|
Thank you both for your perspectives. It is understood that we
can do a better job of communicating, both during and after
outages. Such events are convenient for absolutely no one, and we
try very hard to avoid them in the first place. We feel your pain.
I do feel the need to remind folks of the Service Level
Agreements for the numerous services provided by Eclipse
Foundation IT:
https://wiki.eclipse.org/IT_SLA
CBI is a Tier II - Best Effort service, and considering the
Foundation's "shut down" status during that week, I think we fared
pretty well. Strategic members of the EF can alert IT staff
directly, outside of business hours, using SMS text, to expedite
resolution of Tier II & Tier III service outages. If CBI, or
other parts of the Eclipse Foundation's Infra is instrumental to
your business, please consider Strategic Membership, as it has
many benefits.
https://www.eclipse.org/membership/
Denis
On 2021-08-13 9:38 a.m., Christoph
Läubrich wrote:
Thanks
Ed for the detailed time-line. I also can confirm that (from the
point of a simple comitter POV) the outage was not over at Aug 2
(maybe for me 'core services' are just others than from the
infra-POV) but has last far to 4 Aug and I could continue the work
on my issues.
So for me the summary "The outage was extensive, and for core
services, lasted for approximately 18 hours. Non-core services
were degraded for an additional 12 hours." does not feels quite
right but as said before I can't 'proof' that, its jsut that
actually I was only able to resume my work at Aug 4 (120hrs
later!) at laest until the tycho-ci server was restarted ...
so for me it seems a check "are all build servers running and have
executors" is missing from the status page.
Am 13.08.21 um 15:22 schrieb Ed Willink:
Hi
Thank you all for hitting problems quite quickly once you were
engaged. Perhaps this 'bystander's' perspective may help to
understand the need to communicate better.
I first became aware of the problem after receiving notification
a little after 2:42 EDT 1-Aug that a weekly OCL rebuild had
failed. Investigation of the log pointed a finger at the GIT
repo and eclipsestatus.io indicated that a major outage was in
progress with an 'investigating' tweet. Clearly someone was on
the case and so the bystander effect took over and I didn't
raise any reports or emails to distract.
'investigating' status advanced to 'fix-in-progress' after an
hour.
But then nothing for a further 5 hours, at which point we got
'it will take 13 hours'. On twitter someone asked when the 13
hours started; one might have hoped that it would be from the
'fix-in-progress' time. This tweet and an 'ETA?' tweet were
never answered.
17 hours later we got 'most websites' back, which might be true
but with important services down, it was misleading. It took a
further perhaps 4 hours
forhttps://download.eclipse.org/tools/orbit/downloads/latest-I
<https://download.eclipse.org/tools/orbit/downloads/latest-I>
to return, and 50 hours before projects-storage.eclipse.org
<mailto:genie.modisco@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> was
back and another couple of hours to get
/shared/common/apache-ant-latest/bin/ant back.
IMHO the outage lasted until at least the restoration of
projects-storage.eclipse.org
<mailto:genie.modisco@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> at Aug
4 8:50 and so one of the issues to be addressed by the
postmortem must be why the status page still reports no
incidents or outage on the whole of the 3rd Aug when, for
committers at least, there was no useable service all day.
I must thank the team again for their hard work with a very
difficult problem, but must also stress that the communication
was very poor. So much so that at 3:07 EDT on 4th Aug I sent a
private email to Ed Merks speculating that:
/The total silence from the team is now way beyond
incompetence/discourtesy/embarrassment; there must be another
reason. //
////
//Paranoia sets in. //
////
//Is some government / hostile agency intervening to prevent
communication? //
////
//Are the team voluntarily maintaining silence to contain a
security issue? /
Please ensure that whenever possible the status updates are much
more informative.
Regards
Ed Willink
On 09/08/2021 21:45, Denis Roy wrote:
I very much appreciate the sympathy and the support. In the
end, the Infra team can do better than this. We'll lick our
wounds and go back to the drawing board to make sure we don't
repeat the same mistakes twice.
Postmortem is written, pending review with my team.
Denis
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
Virus-free. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev
_______________________________________________
cross-project-issues-dev mailing list
cross-project-issues-dev@xxxxxxxxxxx
To unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev
--
Denis Roy
Director, IT Services | Eclipse Foundation
Eclipse Foundation: The Community for Open Innovation and Collaboration
Twitter: @droy_eclipse