[
Date Prev][
Date Next][
Thread Prev][
Thread Next][
Date Index][
Thread Index]
[
List Home]
Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
|
The stability issues mentioned below -- my paraphrase: GlassFish won't
start properly if multiple tests are being run in parallel -- are
concerning. We really need to get to the bottom of this. The whole point
of moving these into K8s containers was to provide isolation and stability.
We will need to consider possible expansion of resources as well -- so
independence and reliability need to be addressed. As we move forward --
it seems entirely plausible we might be doing work in multiple feature
branches -- e.g. maybe Jakarta EE 9.1 w/JDK11+ support only and Jakarta
EE 10 with 9.1 changes AND new features for Jakarta EE 10. We may be
expanding the test matrix requirements as well -- I do not know but we
should try to consider this as we start investigating optimizations
and/or resource allocation changes.
Based on your numbers, is it possible that the upper limit is on Memory,
not CPU? (I make this comment in relation to your observation that
Jakarta EE TCK should have up to 100 vCPUs but never gets more than 76
(though I also don't get the fractional Max value but, that's for a
later date.) )
Please be careful that we don't get too distracted about trying to
optimize any of this right now. We can spend some time with that once
the TCKs are all finalized -- put another way, if you had spare cycles,
I'd rather you put time into moving the Jakarta EE TCKs to their final
status, before running experiments to find out how these change Resource
Pack usage rates. I'll also note that weekends are a frequent time for
our committer members who are contributing as a side-light to their
regular job so, we should be a bit careful with that, as well.
I am totally not surprised that this is a rather burst-like usage
pattern. Of course, that makes planning more tricky because you'd
ideally like the utilization to be consistent and robust. The parallel
operation model for the TCKs -- both running multiple stand-alone TCKs
and running the Jakarta EE Platform TCK are designed to run this way --
they are optimized to get completed test results as quickly as possible
-- not to sequence the tests, one after another. Having delivered this
for a long time, I can definitely say, I prefer the results sooner,
rather than waiting days and days.
-- Ed
On 9/30/2020 11:11 AM, Scott Marlow wrote:
Here are the average + max Memory/#CpuCores:
avg memory.limit Max Memory average cpu limits Max CPU
===== ===== ====== ========
61.58 Gi 378.00 Gi 12.1
vCPU 74.7 vCPU
There are some cpu/memory limits in Jenkinsfile
(https://urldefense.com/v3/__https://github.com/eclipse-ee4j/jakartaee-tck/blob/master/Jenkinsfile*L147__;Iw!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubOh13CnE$
), each memory limit is specifying the container/VM memory size
(since we didn't specify the initial memory request setting), so the
calculation is something like:
memory usage = 10Gi per VM * number of test groups
CPU core = 2 * number of test groups
The data-capture does give us a high level view of what the container
level memory/CPU core usage has been. Quoting from a previous TCK ml
conversation (from David Blevins with subject: "Resource Pack
Allocations & Maximizing Use"):
"
Over all of EE4J we have 105 resource packs paid for that give us a
total of 210 cpu cores and 840 GB RAM. These resource packs are
dedicated, not elastic. The actual allocation of 105 resource packs
is by project. The biggest allocation is 50 resource packs to
ee4j.jakartaee-tck (this project), the second biggest is 15 resource
packs to ee4j.glassfish.
The most critical takeaway from the above is we have 50 resource
packs dedicated to this project giving us a total of 100 cores and
400GB ram at our disposal 24x7. These 50 are bought and paid for --
we do not save money if we don't use them.
"
So, the Platform TCK is budgeted to use 100 cores and 400GB ram,
however, we haven't used more than 75 CPU cores and 378gb of memory
(as per numbers max memory/cpu numbers pasted above).
I think the fundamental question is: can we manage this resource,
hence the cost, based on these data?
Imo, I think there is memory/cpu tuning that we could do if there is
time to experiment before answers are needed regarding current usage
versus what usage could be.
Alwin helped me to create a Platform TCK runner job that can run
against my github repository. Thanks Alwin!
I created
https://urldefense.com/v3/__https://github.com/scottmarlow/jakartaee-tck/tree/tuning__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubC77yR40$
to represent changes to improve our memory/cpu tuning.
When we have time to try memory/cpu tuning improvements, we can run
tests with
https://urldefense.com/v3/__https://ci.eclipse.org/jakartaee-tck/job/jakartaee-tck-scottmarlow__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubdTey-6o$
against the `tuning` branch. Pull requests are welcome! :-)
So, I think this identifies the `how we can try making improvements to
our usage`. I'm also hoping that reducing our memory/cpu usage can
translate into being able to run more concurrent tests at the same time.
Currently, we also have to avoid starting multiple Platform TCK test
runs at the same time or we hit test stability problems (GlassFish
won't start correctly for some tests).
You are also welcome to review any of the commentary and ask
questions directly via the issue.
I asked on
https://urldefense.com/v3/__https://bugs.eclipse.org/bugs/show_bug.cgi?id=565098__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubSQ_qLy8$
about measuring usage for a weekend or over a few days.
The answer is that the measuring is always on and can be observed as
per links mentioned in the bugzilla issue. This will require some
dancing as we need to ensure that no other tests are run the same day
(until after we have noted the usage for the `tuning` test run). This
is important so that we have a way to compare use of different settings.
I'm not sure of when we will have time to do this testing yet but
would be nice to fit it in.
Scott