Job:
#OCPBUGS-23746issue3 weeks agoopenshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type New
Issue 15637203: openshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type
Description: h2. Description of problem:
 
 Seen [in 4.15 update CI|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1727427846533550080]:
 {code:none}
 : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less
 Run #0: Failed expand_less	1h28m25s
 {  1 unexpected clusteroperator state transitions during e2e test run 
 
 Nov 22 21:47:32.876 - 1s    E clusteroperator/openshift-apiserver condition/Available reason/APIServices_Error status/False APIServicesAvailable: rpc error: code = Unknown desc = malformed header: missing HTTP content-type}
 {code}
 While the Kube API server, if that's what's missing the header, is supposed to always be available, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the openshift-apiserver operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required.
 h2. Version-Release number of selected component (if applicable):
 {code:none}
 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-ppc64le (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-s390x (all) - 4 runs, 25% failed, 200% of failures match = 50% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 200% of failures match = 40% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 21% of failures match = 12% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 13% of failures match = 4% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 6% of failures match = 3% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 63% of failures match = 10% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 11% of failures match = 2% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
 periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 14% of failures match = 7% impact
 {code}
 
 The impact rates are low enough that I haven't checked older 4.y.  And it's possible that some of those matches have the operator going {{Available=False}} for other reasons besides {{APIServices_Error}}:
 
 {code:none}
 $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/openshift-apiserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
       2 openshift-apiserver APIServerDeployment_NoPod
       2 openshift-apiserver APIServerDeployment_PreconditionNotFulfilled
      19 openshift-apiserver APIServices_Error
      22 openshift-apiserver APIServerDeployment_NoDeployment
 {code}
 
 h2. How reproducible:
 
 {{12% impact}} for {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} looks like the highest impact among the jobs with double-digit run counts.
 
 h2. Steps to Reproduce:
 
 Run {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} a bunch of times watching the {{openshift-apiserver}} ClusterOperator's {{Available}} condition.
 
 h2. Actual results:
 
 Some very brief blips of {{Available=False}} that self-resolve before an admin could possibly resolve to the summons.
 
 h2. Expected results:
 
 No quickly-resolving blips in CI.  No long runs of {{Available=False}} for issues that don't seem worth summoning an admin.  Still going {{Available=False}} for outages that need immediate admin response.
Status: New
#OCPBUGS-54927issue3 days agoCI: API is broken in periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview-serial CLOSED
{code:java}
: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available {code}
My understanding is that this fails if the "openshift-apiserver" moves out of Available to some other state. If this happens often then it may justify why other things are reporting "connection refused". The last time this test passed was on Feb 11th and it also started to fail later on the same day never to recover, I went run by run and I could not see a single success run after that.

Found in 0.00% of runs (0.00% of failures) across 19 total runs and 1 jobs (15.79% failed) in 4.663s - clear search | chart view - source code located on github