Job:
#OCPBUGS-36738issue3 days agoFrequent router-default pod restarts CLOSED
Issue 16127694: Frequent router-default pod restarts
Description: Description of problem:
 {code:none}
   SREP started to receive an increase in errors on the console probes and noticed frequent restarts of the router-default pods{code}
 *What triage steps have been taken so far?:*
 
 Console probes are failing, the `router-default` pods are experiencing timeouts and the sdn-controller pod has warnings about issues with RBAC.
 
 Issues are persistent.  
 
 *What logs have been reviewed (attach them?):*
 
  
 
 blackbox-exporter probes for the console are failing.
 {code:java}
 ts=2024-07-09T09:40:31.413312762Z caller=main.go:189 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Beginning probe" probe=http timeout_seconds=5
 ts=2024-07-09T09:40:31.413415127Z caller=http.go:328 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Resolving target address" target=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com ip_protocol=ip6
 ts=2024-07-09T09:40:31.41467833Z caller=http.go:328 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Resolved target address" target=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com ip=54.237.159.91
 ts=2024-07-09T09:40:31.414763196Z caller=client.go:252 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Making HTTP request" url=https://54.237.159.91/health host=console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com
 ts=2024-07-09T09:40:36.417621502Z caller=handler.go:119 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=error msg="Error for HTTP request" err="Get \"https://54.237.159.91/health\": context deadline exceeded"
 ts=2024-07-09T09:40:36.417670499Z caller=handler.go:119 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=info msg="Response timings for roundtrip" roundtrip=0 start=2024-07-09T09:40:31.41484715Z dnsDone=2024-07-09T09:40:31.41484715Z connectDone=2024-07-09T09:40:31.41583556Z gotConn=0001-01-01T00:00:00Z responseStart=0001-01-01T00:00:00Z tlsStart=2024-07-09T09:40:31.41585747Z tlsDone=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
 ts=2024-07-09T09:40:36.417692223Z caller=main.go:189 module=http_2xx target=https://console-openshift-console.apps.app-sre-prod-04.i5h0.p1.openshiftapps.com/health level=error msg="Probe failed" duration_seconds=5.004348192 {code}
  
 
  
 {code:java}
 $ oc describe pod router-default-7bf67dcb5c-h7x2z -n openshift-ingress
 Last State: Terminated Reason: Error Message: http failed: read tcp 127.0.0.1:40526->127.0.0.1:80: i/o timeout 
 I0705 09:50:58.430264 1 healthz.go:261] backend-proxy-http check failed: healthz [-]backend-proxy-http failed: read tcp 127.0.0.1:53010->127.0.0.1:80: i/o timeout 
 I0705 09:50:58.430273 1 healthz.go:261] backend-proxy-http check failed: healthz [-]backend-proxy-http failed: read tcp 127.0.0.1:53002->127.0.0.1:80: i/o timeout{code}
  
 
  
 {code:java}
 $ oc logs sdn-controller-hlq4w -n openshift-sdn
 I0705 10:51:49.147384 1 master.go:56] Initializing SDN master 
 W0705 10:51:49.161823 1 master.go:156] Failed to list pods: pods is forbidden: User "system:serviceaccount:openshift-sdn:sdn-controller" cannot list resource "pods" in API group "" at the cluster scope 
 W0705 10:51:49.163066 1 master.go:161] Failed to list services: services is forbidden: User "system:serviceaccount:openshift-sdn:sdn-controller" cannot list resource "services" in API group "" at the cluster scope {code}
 The cluster completed an upgrade to 4.15.19 shortly before we started seeing the issue.
 
 *Additional info:*
 
 Ongoing thread with networking team - [https://redhat-internal.slack.com/archives/CDCP2LA9L/p1720179270455589]
Status: CLOSED
periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade (all) - 360 runs, 49% failed, 1% of failures match = 1% impact
#1816386786725728256junit44 hours ago
cause/Error code/2 reason/ContainerExit -status" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found]
E0725 09:01:35.620432       1 main.go:93] Failed to list v1 volumesnapshots with error=volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope: RBAC: [clusterrole.rbac.authorization.k8s.io "system:scope-impersonation" not found, clusterrole.rbac.authorization.k8s.io "system:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "console-extensions-reader" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:discovery" not found, clusterrole.rbac.authorization.k8s.io "cluster-status" not found, clusterrole.rbac.authorization.k8s.io "system:basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:service-account-issuer-discovery" not found, clusterrole.rbac.authorization.k8s.io "basic-user" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:scc:restricted-v2" not found, clusterrole.rbac.authorization.k8s.io "openshift-csi-snapshot-controller-runner" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-docker" not found, clusterrole.rbac.authorization.k8s.io "system:webhook" not found, clusterrole.rbac.authorization.k8s.io "system:discovery" not found, clusterrole.rbac.authorization.k8s.io "system:openshift:public-info-viewer" not found, clusterrole.rbac.authorization.k8s.io "self-access-reviewer" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-source" not found, clusterrole.rbac.authorization.k8s.io "system:build-strategy-jenkinspipeline" not found, clusterrole.rbac.authorization.k8s.io "system:oauth-token-deleter" not found, clusterrole.rbac.authorization.k8s.io "helm-chartrepos-viewer" not found]
I0725 09:01:40.899404       1 leaderelection.go:250] attempting to acquire leader lease openshift-cluster-storage-operator/snapshot-controller-leader...
#1815751259123093504junit3 days ago
I0723 14:55:08.657079       1 main.go:206] Start NewCSISnapshotController with kubeconfig [] resyncPeriod [15m0s]
E0723 14:55:27.554762       1 main.go:93] Failed to list v1 volumesnapshots with error=volumesnapshots.snapshot.storage.k8s.io is forbidden: User "system:serviceaccount:openshift-cluster-storage-operator:csi-snapshot-controller" cannot list resource "volumesnapshots" in API group "snapshot.storage.k8s.io" at the cluster scope
I0723 14:55:33.128177       1 leaderelection.go:250] attempting to acquire leader lease openshift-cluster-storage-operator/snapshot-controller-leader...

Found in 0.56% of runs (1.14% of failures) across 360 total runs and 1 jobs (48.61% failed) in 6.314s - clear search | chart view - source code located on github