Job:
#OCPBUGS-62517issue5 days agoClusterOperator olm goes Available=False with reason=CatalogdDeploymentCatalogdControllerManager_Deploying or reason=OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying during updates Verified
Issue 17438850: ClusterOperator olm goes Available=False with reason=CatalogdDeploymentCatalogdControllerManager_Deploying or reason=OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying during updates
Description: Description of problem:
 
 [A component must not report Available=False during the course of a normal upgrade.|https://github.com/openshift/api/blob/7f245291a17ac0bd31cf8ba08530c3355b86dbea/config/v1/types_cluster_operator.go#L156]
 
 ClusterOperator olm goes Available=False with reason=CatalogdDeploymentCatalogdControllerManager_Deploying or reason=OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying during updates
 
 Example job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade/1972489796022439936
 {code:none}
    Sep 29 04:35:47.504 E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment
 Sep 29 04:35:47.504 - 52s   E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment
 Sep 29 04:42:35.127 E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment
 Sep 29 04:42:35.127 - 12s   E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment
  {code}
 Version-Release number of selected component (if applicable):
 
 The issue was spotted with a 4.21 to 4.21 upgrade test.
 {code:none}
     INFO[2025-09-29T02:33:17Z] Using explicitly provided pull-spec for release initial (registry.ci.openshift.org/ocp/release:4.21.0-0.ci-2025-09-28-082535) INFO[2025-09-29T02:33:17Z] Using explicitly provided pull-spec for release latest (registry.ci.openshift.org/ocp/release:4.21.0-0.ci-2025-09-29-022535) {code}
 How reproducible:
 
 Seems always in [the aggregated job|https://prow.ci.openshift.org/view/gs/test-platform-results/logs/aggregated-gcp-ovn-upgrade-4.21-micro-release-openshift-release-analysis-aggregator/1972561250676117504]  but there is also [a green run|https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30308/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/1971564973029068800] in a similar test.
 {code:none}
 ### failure
 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.21-e2e-gcp-ovn-upgrade/1972489796022439936/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20250929-034333.xml | grep 'clusteroperator/olm should not change condition/Available' -A1
     <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm should not change condition/Available" time="7014.05639286">
         <failure message="">4 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:&#xA;&#xA;Sep 29 04:35:47.504 E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment&#xA;Sep 29 04:35:47.504 - 52s   E clusteroperator/olm condition/Available reason/CatalogdDeploymentCatalogdControllerManager_Deploying status/False CatalogdDeploymentCatalogdControllerManagerAvailable: Waiting for Deployment&#xA;Sep 29 04:42:35.127 E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment&#xA;Sep 29 04:42:35.127 - 12s   E clusteroperator/olm condition/Available reason/OperatorcontrollerDeploymentOperatorControllerControllerManager_Deploying status/False OperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Waiting for Deployment&#xA;&#xA;2 unwelcome but acceptable clusteroperator state transitions during e2e test run.  These should not happen, but because they are tied to exceptions, the fact that they did happen is not sufficient to cause this test-case to fail:&#xA;&#xA;Sep 29 04:36:39.932 W clusteroperator/olm condition/Available reason/AsExpected status/True CatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available\nOperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case)&#xA;Sep 29 04:42:48.072 W clusteroperator/olm condition/Available reason/AsExpected status/True CatalogdDeploymentCatalogdControllerManagerAvailable: Deployment is available\nOperatorcontrollerDeploymentOperatorControllerControllerManagerAvailable: Deployment is available (exception: Available=True is the happy case)&#xA;</failure>
 
 ### success
 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/30308/pull-ci-openshift-origin-main-e2e-gcp-ovn-upgrade/1971564973029068800/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/e2e-monitor-tests__20250926-142805.xml | grep 'clusteroperator/olm should not change condition/Available' -A1
     <testcase name="[Monitor:legacy-cvo-invariants][bz-OLM] clusteroperator/olm should not change condition/Available" time="0"></testcase>
     <testcase name="[Monitor:legacy-cvo-invariants][bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available" time="0"></testcase>{code}
 Steps to Reproduce:
 {code:none}
     1. Run the aggregated job above
     2.
     3.
     {code}
 Actual results:
 {code:none}
 co/olm goes Available=True during the upgrade test.{code}
 Expected results:
 {code:none}
 co/olm stays Available=True during the upgrade test.{code}
 Additional info:
 {code:none}
 The failures were taken from 4.21 to 4.21 upgrade test. It could go with earlier versions too.{code}
Status: Verified
#OCPBUGS-23746issue3 weeks agoopenshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type New
Issue 15637203: openshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type
Description: h2. Description of problem:
 
 Seen [in 4.15 update CI|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1727427846533550080]:
 {code:none}
 : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less
 Run #0: Failed expand_less	1h28m25s
 {  1 unexpected clusteroperator state transitions during e2e test run 
 
 Nov 22 21:47:32.876 - 1s    E clusteroperator/openshift-apiserver condition/Available reason/APIServices_Error status/False APIServicesAvailable: rpc error: code = Unknown desc = malformed header: missing HTTP content-type}
 {code}
 While the Kube API server, if that's what's missing the header, is supposed to always be available, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the openshift-apiserver operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required.
 h2. Version-Release number of selected component (if applicable):
 {code:none}
 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-ppc64le (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-s390x (all) - 4 runs, 25% failed, 200% of failures match = 50% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 200% of failures match = 40% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 21% of failures match = 12% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 13% of failures match = 4% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 6% of failures match = 3% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 63% of failures match = 10% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 11% of failures match = 2% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
 periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 14% of failures match = 7% impact
 {code}
 
 The impact rates are low enough that I haven't checked older 4.y.  And it's possible that some of those matches have the operator going {{Available=False}} for other reasons besides {{APIServices_Error}}:
 
 {code:none}
 $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/openshift-apiserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
       2 openshift-apiserver APIServerDeployment_NoPod
       2 openshift-apiserver APIServerDeployment_PreconditionNotFulfilled
      19 openshift-apiserver APIServices_Error
      22 openshift-apiserver APIServerDeployment_NoDeployment
 {code}
 
 h2. How reproducible:
 
 {{12% impact}} for {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} looks like the highest impact among the jobs with double-digit run counts.
 
 h2. Steps to Reproduce:
 
 Run {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} a bunch of times watching the {{openshift-apiserver}} ClusterOperator's {{Available}} condition.
 
 h2. Actual results:
 
 Some very brief blips of {{Available=False}} that self-resolve before an admin could possibly resolve to the summons.
 
 h2. Expected results:
 
 No quickly-resolving blips in CI.  No long runs of {{Available=False}} for issues that don't seem worth summoning an admin.  Still going {{Available=False}} for outages that need immediate admin response.
Status: New

Found in 0.00% of runs (NaN% of failures) across 2 total runs and 1 jobs (0.00% failed) in 95ms - clear search | chart view - source code located on github