Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J wor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems

From: Scott Marlow <smarlow@xxxxxxxxxx>
Date: Wed, 30 Sep 2020 22:54:35 -0400
Delivered-to: jakartaee-tck-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/jakartaee-tck-dev>
List-help: <mailto:jakartaee-tck-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/jakartaee-tck-dev>, <mailto:jakartaee-tck-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/jakartaee-tck-dev>, <mailto:jakartaee-tck-dev-request@eclipse.org?subject=unsubscribe>
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0


On 9/30/20 8:04 PM, Ed Bratt wrote:

The stability issues mentioned below -- my paraphrase: GlassFish won'tstart properly if multiple tests are being run in parallel -- areconcerning. We really need to get to the bottom of this. The whole pointof moving these into K8s containers was to provide isolation and stability.

We are finally seeing usage data which is a great step forward inunderstanding! :-)

We cannot really point fingers at a cause yet, we have some suspects inmind that are within our control. We haven't seen any container/systemlevel failures that I recall. The closest to a system level failure waswhen we were seeing `git` failures due to our JNLP memory size being toosmall (increasing the JNLP memory size solved the `git` failures). Ithink it was around April 2020 when we fixed the `git` oom failures.

One stability change that we can make that is within our control, is toset a trap handler or `try {} finally { cleanup() }` handler to ensurethat all started test processes are terminated during each test run.

We might also consider if we need to wait for time waited listeningports to actually be closed during cleanup as well (simply don't wait ifthere are none).

One symptom that we have seen in our CI environment when running twoconcurrent Platform TCK test runs is reported onhttps://github.com/eclipse-ee4j/glassfish/issues/23191. We do need tosolve issues/23191 at some point. I can recreate a related failurelocally that could be the same issue (hard to know for sure). We mostlyavoid this by either rerunning the TCK tests if we see it and also avoidstarting multiple concurrent Platform TCK test runs.

Another approach to handling glassfish/issues/23191 in CI, is to assumethat problems like that can happen, so we could handle them with a loopthat retries after terminating GlassFish, sleeping for the right amountof time and trying again a few times.

We will need to consider possible expansion of resources as well -- soindependence and reliability need to be addressed. As we move forward --it seems entirely plausible we might be doing work in multiple featurebranches -- e.g. maybe Jakarta EE 9.1 w/JDK11+ support only and JakartaEE 10 with 9.1 changes AND new features for Jakarta EE 10. We may beexpanding the test matrix requirements as well -- I do not know but weshould try to consider this as we start investigating optimizationsand/or resource allocation changes.
Based on your numbers, is it possible that the upper limit is on Memory,not CPU? (I make this comment in relation to your observation thatJakarta EE TCK should have up to 100 vCPUs but never gets more than 76(though I also don't get the fractional Max value but, that's for alater date.) )

Yes, I agree and made the same conclusion that we first hit an upperlimit on memory, not CPUs. IMO, we should be able to reduce the memoryused per container/VM in which case it may be useful to use one CPU percontainer/VM to get more (testing) bang for our bucks out of the system.

One comment I saw today mentioned the a high max jvm memory setting isto work around memory issues with the EJB tests. IMO, we should createmore separate test groups for the EJB tests to see if that helps reducethe need for 10gb per test container/vm.

Please be careful that we don't get too distracted about trying tooptimize any of this right now. We can spend some time with that oncethe TCKs are all finalized -- put another way, if you had spare cycles,I'd rather you put time into moving the Jakarta EE TCKs to their finalstatus, before running experiments to find out how these change ResourcePack usage rates. I'll also note that weekends are a frequent time forour committer members who are contributing as a side-light to theirregular job so, we should be a bit careful with that, as well.

I agree, I did get excited and wanted to push a little more on coming upwith a way to help us get further answers about our ci environment +testing some time this year, if that helps us make a decision for nextyear. So, I think we have a path that we could follow forward, timepermitting.

I am totally not surprised that this is a rather burst-like usagepattern.

The data usage numbers are not that useful in that they don't show howthe resources are used exactly (building Platform TCK, running PlatformTCK, building Standalone TCKs, running Standalone TCKs). We know ittakes longer to run Platform TCKs.

Perhaps some day we will be able to correlate each running TCK job withthe usage report to allow more detailed usage reporting per type of TCK job.

Of course, that makes planning more tricky because you'dideally like the utilization to be consistent and robust.


Agreed.

The paralleloperation model for the TCKs -- both running multiple stand-alone TCKsand running the Jakarta EE Platform TCK are designed to run this way --they are optimized to get completed test results as quickly as possible-- not to sequence the tests, one after another. Having delivered thisfor a long time, I can definitely say, I prefer the results sooner,rather than waiting days and days.

Agreed, we just need to run more correctly/defensively and I think wewill get there.


Scott

-- Ed

On 9/30/2020 11:11 AM, Scott Marlow wrote:
Here are the average + max Memory/#CpuCores:

avg memory.limit    Max Memory        average cpu limits Max CPU
=====                   ===== ======                    ========
61.58 Gi 378.00 Gi 12.1vCPU 74.7 vCPU
There are some cpu/memory limits in Jenkinsfile(https://urldefense.com/v3/__https://github.com/eclipse-ee4j/jakartaee-tck/blob/master/Jenkinsfile*L147__;Iw!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubOh13CnE$), each memory limit is specifying the container/VM memory size(since we didn't specify the initial memory request setting), so thecalculation is something like:
memory usage = 10Gi per VM * number of test groups

CPU core = 2 * number of test groups
The data-capture does give us a high level view of what the containerlevel memory/CPU core usage has been. Quoting from a previous TCK mlconversation (from David Blevins with subject: "Resource PackAllocations & Maximizing Use"):
"
Over all of EE4J we have 105 resource packs paid for that give us atotal of 210 cpu cores and 840 GB RAM. These resource packs arededicated, not elastic. The actual allocation of 105 resource packsis by project. The biggest allocation is 50 resource packs toee4j.jakartaee-tck (this project), the second biggest is 15 resourcepacks to ee4j.glassfish.
The most critical takeaway from the above is we have 50 resourcepacks dedicated to this project giving us a total of 100 cores and400GB ram at our disposal 24x7. These 50 are bought and paid for --we do not save money if we don't use them.
"
So, the Platform TCK is budgeted to use 100 cores and 400GB ram,however, we haven't used more than 75 CPU cores and 378gb of memory(as per numbers max memory/cpu numbers pasted above).
I think the fundamental question is: can we manage this resource,hence the cost, based on these data?
Imo, I think there is memory/cpu tuning that we could do if there istime to experiment before answers are needed regarding current usageversus what usage could be.
Alwin helped me to create a Platform TCK runner job that can runagainst my github repository. Thanks Alwin!
I createdhttps://urldefense.com/v3/__https://github.com/scottmarlow/jakartaee-tck/tree/tuning__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubC77yR40$to represent changes to improve our memory/cpu tuning.
When we have time to try memory/cpu tuning improvements, we can runtests withhttps://urldefense.com/v3/__https://ci.eclipse.org/jakartaee-tck/job/jakartaee-tck-scottmarlow__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubdTey-6o$against the `tuning` branch. Pull requests are welcome! :-)
So, I think this identifies the `how we can try making improvements toour usage`. I'm also hoping that reducing our memory/cpu usage cantranslate into being able to run more concurrent tests at the same time.
Currently, we also have to avoid starting multiple Platform TCK testruns at the same time or we hit test stability problems (GlassFishwon't start correctly for some tests).
You are also welcome to review any of the commentary and askquestions directly via the issue.
I asked onhttps://urldefense.com/v3/__https://bugs.eclipse.org/bugs/show_bug.cgi?id=565098__;!!GqivPVa7Brio!K6SiwGSX9lBaEKBbtvCH6386RJfFh1TVdZrGAH_A4H2aAbNuuSrBjJubSQ_qLy8$about measuring usage for a weekend or over a few days.
The answer is that the measuring is always on and can be observed asper links mentioned in the bugzilla issue. This will require somedancing as we need to ensure that no other tests are run the same day(until after we have noted the usage for the `tuning` test run). Thisis important so that we have a way to compare use of different settings.
I'm not sure of when we will have time to do this testing yet butwould be nice to fit it in.
Scott

Follow-Ups:
- Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
  - From: Ed Bratt

References:
- [jakartaee-tck-dev] Tracking usage data for EE4J working group CI cloud systems
  - From: Ed Bratt
- Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
  - From: Scott Marlow
- Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
  - From: Scott Marlow
- Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
  - From: Ed Bratt

Prev by Date: Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
Next by Date: [jakartaee-tck-dev] How to map tests to specs
Previous by thread: Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
Next by thread: Re: [jakartaee-tck-dev] [glassfish-dev] Tracking usage data for EE4J working group CI cloud systems
Index(es):
- Date
- Thread

Breadcrumbs