Hello all.
I am working on a fix for [1] that is supposed to properly fix [2] (the problem is actual for Che as well).
I have a draft PR [3] that fixes the issue (OOM problem on large clusters, so we can revert the 5 GiB memory limit workaround for Che Operator).
In short, to solve the issue we create a custom cache function for the Operator's k8s client that defines which k8s objects should be cached. However, there are some limitations on technical level in Operator's controller runtime framework. So the cost of the fix is to require a label on all resources our Operator ever uses (even for read-only access).
Our plan is to require the `
app.kubernetes.io/instance=che` label for all the k8s objects Che Operator deals with. This doesn't change anything for the resources created automatically from Operator, but it affects all user defined objects (for example, a config map with additional CA certs should also have the label and that will be the user's responsibility to add it). Of course, we'll update all the docs and the PR [3] has a migration mechanism (that runs once on Che Operator start), so the existing installations will continue working as expected. But if a user forgets to add the label to a new config map, Che Operator will not see it at all (until the user adds the label or Operator restarts and migration happens).
If someone has some remarks, concerns or ideas we'll be glad to hear and address them.
[1] 
https://github.com/eclipse/che/issues/20647[2] 
https://issues.redhat.com/browse/CRW-2383[3] 
https://github.com/eclipse-che/che-operator/pull/1166-- 
        
          Mykola Morhun
Software engineer
        
        
        
        
          Red Hat