Job:
#OCPBUGS-54927issue3 days agoCI: API is broken in periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-single-node-techpreview-serial CLOSED
{code:java}
: [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available {code}
My understanding is that this fails if the "openshift-apiserver" moves out of Available to some other state. If this happens often then it may justify why other things are reporting "connection refused". The last time this test passed was on Feb 11th and it also started to fail later on the same day never to recover, I went run by run and I could not see a single success run after that.
#OCPBUGS-23746issue3 weeks agoopenshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type New
Issue 15637203: openshift-apiserver ClusterOperator should not blip Available=False on brief missing HTTP content-type
Description: h2. Description of problem:
 
 Seen [in 4.15 update CI|https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1727427846533550080]:
 {code:none}
 : [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available expand_less
 Run #0: Failed expand_less	1h28m25s
 {  1 unexpected clusteroperator state transitions during e2e test run 
 
 Nov 22 21:47:32.876 - 1s    E clusteroperator/openshift-apiserver condition/Available reason/APIServices_Error status/False APIServicesAvailable: rpc error: code = Unknown desc = malformed header: missing HTTP content-type}
 {code}
 While the Kube API server, if that's what's missing the header, is supposed to always be available, an issue that only persists for 1s is not long enough to warrant [immediate admin intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153]. Teaching the openshift-apiserver operator to stay {{Available=True}} for this kind of brief hiccup, while still going {{Available=False}} for issues where [least part of the component is non-functional, and that the condition requires immediate administrator intervention|https://github.com/openshift/api/blob/c3f7566f6ef636bb7cf9549bf47112844285989e/config/v1/types_cluster_operator.go#L149-L153] would make it easier for admins and SREs operating clusters to identify when intervention was required.
 h2. Version-Release number of selected component (if applicable):
 {code:none}
 $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/openshift-apiserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-ppc64le (all) - 4 runs, 100% failed, 25% of failures match = 25% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-ibmcloud-ovn-multi-s390x (all) - 4 runs, 25% failed, 200% of failures match = 50% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 200% of failures match = 40% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 21% of failures match = 12% impact
 periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 13% of failures match = 4% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 6% of failures match = 3% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 63% of failures match = 10% impact
 periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 11% of failures match = 2% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-e2e-vsphere-ovn-techpreview (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
 periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
 periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 14% of failures match = 7% impact
 {code}
 
 The impact rates are low enough that I haven't checked older 4.y.  And it's possible that some of those matches have the operator going {{Available=False}} for other reasons besides {{APIServices_Error}}:
 
 {code:none}
 $ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/openshift-apiserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*|\1 \2|' | sort | uniq -c | sort -n
       2 openshift-apiserver APIServerDeployment_NoPod
       2 openshift-apiserver APIServerDeployment_PreconditionNotFulfilled
      19 openshift-apiserver APIServices_Error
      22 openshift-apiserver APIServerDeployment_NoDeployment
 {code}
 
 h2. How reproducible:
 
 {{12% impact}} for {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} looks like the highest impact among the jobs with double-digit run counts.
 
 h2. Steps to Reproduce:
 
 Run {{periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade}} a bunch of times watching the {{openshift-apiserver}} ClusterOperator's {{Available}} condition.
 
 h2. Actual results:
 
 Some very brief blips of {{Available=False}} that self-resolve before an admin could possibly resolve to the summons.
 
 h2. Expected results:
 
 No quickly-resolving blips in CI.  No long runs of {{Available=False}} for issues that don't seem worth summoning an admin.  Still going {{Available=False}} for outages that need immediate admin response.
Status: New
periodic-ci-openshift-multiarch-master-nightly-4.16-upgrade-from-stable-4.15-ocp-e2e-upgrade-gcp-ovn-heterogeneous (all) - 24 runs, 17% failed, 225% of failures match = 38% impact
#1932815290828066816junit2 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1932925413055533056junit40 hours ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1932486241337479168junit2 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1932339526059954176junit3 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1931818974689890304junit4 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1930783992169107456junit7 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1929260079681376256junit11 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1929078142920560640junit12 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.
#1928539503367032832junit13 days ago
# [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
0 unexpected clusteroperator state transitions during e2e test run, as desired.

Found in 37.50% of runs (225.00% of failures) across 24 total runs and 1 jobs (16.67% failed) in 259ms - clear search | chart view - source code located on github