-
Notifications
You must be signed in to change notification settings - Fork 51
Closed
canonical/charmed-kubeflow-uats
#68Labels
bugSomething isn't workingSomething isn't working
Description
Bug Description
Training-operator UAT starts failing after bumping k8s-version to 1.28 on AKS with AssertionError: Job pytorch-dist-mnist-gloo was not successful.. This is the case both for CKF latest/edge and 1.8/stable. Unfortunately, we do not have more detailed logs due to known limitation of how our UATs run canonical/charmed-kubeflow-uats#4.
Example runs
- 1st failed run on AKS k8s 1.28 CKF latest/edge
- 2nd failed run on AKS k8s 1.28 CKF latest/edge
- failed run on AKS k8s 1.28 CKF 1.8/stable
- Successful run on AKS k8s 1.26
To Reproduce
Run CI for k8s version 1.28
Environment
AKS k8s 1.28
Juju 3.1
for 1.8 juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow aks-controller aks/westeurope 3.1.8 unsupported 09:19:35Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook 1.8/stable 301 10.0.245.250 no
argo-controller active 1 argo-controller 3.3.10/stable 424 10.0.249.157 no
dex-auth active 1 dex-auth 2.36/stable 422 10.0.185.107 no
envoy res:oci-image@cc06b3e active 1 envoy 2.0/stable 101 10.0.244.49 no
istio-ingressgateway active 1 istio-gateway 1.17/stable 723 10.0.216.118 no
istio-pilot active 1 istio-pilot 1.17/stable 827 10.0.173.92 no
jupyter-controller active 1 jupyter-controller 1.8/stable 849 10.0.75.253 no
jupyter-ui active 1 jupyter-ui 1.8/stable 858 10.0.184.139 no
katib-controller res:oci-image@b6a6100 active 1 katib-controller 0.16/stable 446 10.0.106.5 no
katib-db 8.0.35-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 127 10.0.233.45 no
katib-db-manager active 1 katib-db-manager 0.16/stable 411 10.0.188.36 no
katib-ui active 1 katib-ui 0.16/stable 422 10.0.126.70 no
kfp-api active 1 kfp-api 2.0/stable 1035 10.0.86.37 no
kfp-db 8.0.35-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/stable 127 10.0.57.119 no
kfp-metadata-writer active 1 kfp-metadata-writer 2.0/stable 118 10.0.61.100 no
kfp-persistence active 1 kfp-persistence 2.0/stable 1039 10.0.131.226 no
kfp-profile-controller active 1 kfp-profile-controller 2.0/stable 998 10.0.184.246 no
kfp-schedwf active 1 kfp-schedwf 2.0/stable 1052 10.0.234.76 no
kfp-ui active 1 kfp-ui 2.0/stable 1034 10.0.225.138 no
kfp-viewer active 1 kfp-viewer 2.0/stable 1064 10.0.229.253 no
kfp-viz active 1 kfp-viz 2.0/stable 985 10.0.134.29 no
knative-eventing active 1 knative-eventing 1.10/stable 353 10.0.44.250 no
knative-operator active 1 knative-operator 1.10/stable 328 10.0.68.158 no
knative-serving active 1 knative-serving 1.10/stable 354 10.0.61.216 no
kserve-controller active 1 kserve-controller 0.11/stable 523 10.0.11.66 no
kubeflow-dashboard active 1 kubeflow-dashboard 1.8/stable 454 10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:15)7.5 no
kubeflow-profiles active 1 kubeflow-profiles 1.8/stable 355 10.0.68.5 no
kubeflow-roles active 1 kubeflow-roles 1.8/stable 187 10.0.196.222 no
kubeflow-volumes res:oci-image@2261827 active 1 kubeflow-volumes 1.8/stable 260 10.0.29.7 no
metacontroller-operator active 1 metacontroller-operator 3.0/stable 252 10.0.66.178 no
minio res:oci-image@1755999 active 1 minio ckf-1.8/stable 278 10.0.247.208 no
mlmd res:oci-image@44abc5d active 1 mlmd 1.14/stable 127 10.0.219.231 no
oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.8/stable 350 10.0.38.12 no
pvcviewer-operator active 1 pvcviewer-operator 1.8/stable 30 10.0.238.124 no
seldon-controller-manager active 1 seldon-core 1.17/stable 664 10.0.22.127 no
tensorboard-controller active 1 tensorboard-controller 1.8/stable 257 10.0.44.54 no
tensorboards-web-app active 1 tensorboards-web-app 1.8/stable 245 10.0.204.180 no
training-operator active 1 training-operator 1.7/stable 347 10.0.91.235 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.244.0.10
argo-controller/0* active idle 10.244.1.6
dex-auth/0* active idle 10.244.0.12
envoy/0* active idle 10.244.1.34 9090,9901/TCP
istio-ingressgateway/0* active idle 10.244.1.7
istio-pilot/0* active idle 10.244.0.13
jupyter-controller/0* active idle 10.244.0.14
jupyter-ui/0* active idle 10.244.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:16)
katib-controller/0* active idle 10.244.0.34 443,8080/TCP
katib-db-manager/0* active idle 10.244.1.10
katib-db/0* active idle 10.244.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:17) Primary
katib-ui/0* active idle 10.244.1.11
kfp-api/0* active idle 10.244.1.12
kfp-db/0* active idle 10.244.1.13 Primary
kfp-metadata-writer/0* active idle 10.244.0.18
kfp-persistence/0* active idle 10.244.0.20
kfp-profile-controller/0* active idle 10.244.0.22
kfp-schedwf/0* active idle 10.244.0.23
kfp-ui/0* active idle 10.244.0.24
kfp-viewer/0* active idle 10.244.1.15
kfp-viz/0* active idle 10.244.0.26
knative-eventing/0* active idle 10.244.0.[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:18)
knative-operator/0* active idle 10.244.0.28
knative-serving/0* active idle 10.244.0.21
kserve-controller/0* active idle 10.244.1.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:19)
kubeflow-dashboard/0* active idle 10.244.0.27
kubeflow-profiles/0* active idle 10.244.1.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:20)
kubeflow-roles/0* active idle 10.244.1.14
kubeflow-volumes/0* active idle 10.244.1.21 5000/TCP
metacontroller-operator/0* active idle 10.244.0.25
minio/0* active idle 10.244.0.35 9000-9001/TCP
mlmd/0* active idle 10.244.1.35 8080/TCP
oidc-gatekeeper/0* active idle 10.244.1.16
pvcviewer-operator/0* active idle 10.244.1.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:21)
seldon-controller-manager/0* active idle 10.244.1.17
tensorboard-controller/0* active idle 10.244.0.30
tensorboards-web-app/0* active idle 10.244.0.31
training-operator/0* active idle 10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767288780#step:12:25)4.0.32
for latest/edge juju status
Model Controller Cloud/Region Version SLA Timestamp
kubeflow aks-controller aks/westeurope 3.1.8 unsupported 09:26:07Z
App Version Status Scale Charm Channel Rev Address Exposed Message
admission-webhook active 1 admission-webhook latest/edge 308 10.0.16.94 no
argo-controller active 1 argo-controller latest/edge 468 10.0.100.236 no
dex-auth active 1 dex-auth latest/edge 458 10.0.254.87 no
envoy active 1 envoy latest/edge 183 10.0.245.125 no
istio-ingressgateway active 1 istio-gateway latest/edge 900 10.0.44.117 no
istio-pilot active 1 istio-pilot latest/edge 872 10.0.21.240 no
jupyter-controller active 1 jupyter-controller latest/edge 936 10.0.131.139 no
jupyter-ui active 1 jupyter-ui latest/edge 856 10.0.90.58 no
katib-controller active 1 katib-controller latest/edge 526 10.0.152.253 no
katib-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/edge 138 10.0.152.2 no
katib-db-manager active 1 katib-db-manager latest/edge 490 10.0.236.4 no
katib-ui active 1 katib-ui latest/edge 501 10.0.92.22 no
kfp-api active 1 kfp-api latest/edge 1244 10.0.176.102 no
kfp-db 8.0.36-0ubuntu0.22.04.1 active 1 mysql-k8s 8.0/edge 138 10.0.3.211 no
kfp-metadata-writer active 1 kfp-metadata-writer latest/edge 298 10.0.201.207 no
kfp-persistence active 1 kfp-persistence latest/edge 1251 10.0.6.212 no
kfp-profile-controller active 1 kfp-profile-controller latest/edge 1209 10.0.253.135 no
kfp-schedwf active 1 kfp-schedwf latest/edge 1263 10.0.6.119 no
kfp-ui active 1 kfp-ui latest/edge 1246 10.0.221.196 no
kfp-viewer active 1 kfp-viewer latest/edge 1276 10.0.137.58 no
kfp-viz active 1 kfp-viz latest/edge 1197 10.0.127.237 no
knative-eventing active 1 knative-eventing latest/edge 393 10.0.110.159 no
knative-operator active 1 knative-operator latest/edge 368 10.0.[14](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:15)5.205 no
knative-serving active 1 knative-serving latest/edge 394 10.0.147.68 no
kserve-controller active 1 kserve-controller latest/edge 538 10.0.[15](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:16)4.87 no
kubeflow-dashboard active 1 kubeflow-dashboard latest/edge 517 10.0.52.98 no
kubeflow-profiles active 1 kubeflow-profiles latest/edge 379 10.0.[16](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:17)4.223 no
kubeflow-roles active 1 kubeflow-roles latest/edge 207 10.0.205.101 no
kubeflow-volumes active 1 kubeflow-volumes latest/edge 279 10.0.83.113 no
metacontroller-operator active 1 metacontroller-operator latest/edge 280 10.0.153.8 no
minio res:oci-image@[17](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:18)55999 active 1 minio latest/edge 306 10.0.52.197 no
mlmd active 1 mlmd latest/edge 174 10.0.[18](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:19)8.218 no
oidc-gatekeeper active 1 oidc-gatekeeper latest/edge 371 10.0.125.250 no
pvcviewer-operator active 1 pvcviewer-operator latest/edge 74 10.0.97.108 no
seldon-controller-manager active 1 seldon-core latest/edge 691 10.0.87.[19](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:20)5 no
tensorboard-controller active 1 tensorboard-controller latest/edge 281 10.0.30.[20](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:21)1 no
tensorboards-web-app active 1 tensorboards-web-app latest/edge 269 10.0.24.183 no
training-operator active 1 training-operator latest/edge 378 10.0.16.237 no
Unit Workload Agent Address Ports Message
admission-webhook/0* active idle 10.244.0.7
argo-controller/0* active idle 10.244.1.9
dex-auth/0* active idle 10.244.0.8
envoy/0* active idle 10.244.1.11
istio-ingressgateway/0* active idle 10.244.1.10
istio-pilot/0* active idle 10.244.0.9
jupyter-controller/0* active idle 10.244.1.12
jupyter-ui/0* active idle 10.244.1.14
katib-controller/0* active idle 10.244.1.15
katib-db-manager/0* active idle 10.244.1.16
katib-db/0* active idle 10.244.0.12 Primary
katib-ui/0* active idle 10.244.1.17
kfp-api/0* active idle 10.244.1.18
kfp-db/0* active idle 10.244.1.19 Primary
kfp-metadata-writer/0* active idle 10.244.0.13
kfp-persistence/0* active idle 10.244.0.15
kfp-profile-controller/0* active idle 10.244.0.16
kfp-schedwf/0* active idle 10.244.0.18
kfp-ui/0* active idle 10.244.1.20
kfp-viewer/0* active idle 10.244.0.19
kfp-viz/0* active idle 10.244.0.20
knative-eventing/0* active idle 10.244.0.14
knative-operator/0* active idle 10.244.0.22
knative-serving/0* active idle 10.244.0.17
kserve-controller/0* active idle 10.244.1.25
kubeflow-dashboard/0* active idle 10.244.1.23
kubeflow-profiles/0* active idle 10.244.0.24
kubeflow-roles/0* active idle 10.244.1.[21](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:22)
kubeflow-volumes/0* active idle 10.244.0.21
metacontroller-operator/0* active idle 10.244.1.[22](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:23)
minio/0* active idle 10.244.1.24 9000-9001/TCP
mlmd/0* active idle 10.244.1.28
oidc-gatekeeper/0* active idle 10.244.0.[23](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:24)
pvcviewer-operator/0* active idle 10.244.0.26
seldon-controller-manager/0* active idle 10.[24](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:25)4.1.26
tensorboard-controller/0* active idle 10.244.1.27
tensorboards-web-app/0* active idle 10.244.0.[25](https://github.com/canonical/bundle-kubeflow/actions/runs/9004529983/job/24767289055#step:12:26)
training-operator/0* active idle 10.244.1.29Relevant Log Output
test_notebooks.py::test_notebook[training-integration]
-------------------------------- live log call ---------------------------------
INFO test_notebooks:test_notebooks.py:44 Running training-integration.ipynb...
ERROR test_notebooks:test_notebooks.py:58 Cell In[4], line 8, in assert_job_succeeded(client, job_name, job_kind)
1 @retry(
2 wait=wait_exponential(multiplier=2, min=1, max=30),
3 stop=stop_after_attempt(50),
4 reraise=True,
5 )
6 def assert_job_succeeded(client, job_name, job_kind):
7 """Wait for the Job to complete successfully."""
----> 8 assert client.is_job_succeeded(
9 name=job_name, job_kind=job_kind
10 ), f"Job ***job_name*** was not successful."
AssertionError: Job pytorch-dist-mnist-gloo was not successful.
FAILED [100%]
=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________
test_notebook = '/tests/.worktrees/4ca5f8e7474193b125daecbd2dc157f3fe1ab017/tests/notebooks/katib/katib-integration.ipynb'
@pytest.mark.ipynb
@pytest.mark.parametrize(
# notebook - ipynb file to execute
"test_notebook",
NOTEBOOKS.values(),
ids=NOTEBOOKS.keys(),
)
def test_notebook(test_notebook):
"""Test Notebook Generic Wrapper."""
os.chdir(os.path.dirname(test_notebook))
with open(test_notebook) as nb:
notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
ep = ExecutePreprocessor(
timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
)
ep.skip_cells_with_tag = "pytest-skip"
try:
log.info(f"Running ***os.path.basename(test_notebook)***...")
output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
# persist the notebook output to the original file for debugging purposes
save_notebook(output_notebook, test_notebook)
except CellExecutionError as e:
# handle underlying error
pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
for cell in output_notebook.cells:
metadata = cell.get("metadata", dict)
if "raises-exception" in metadata.get("tags", []):
for cell_output in cell.outputs:
if cell_output.output_type == "error":
# extract the error message from the cell output
log.error(format_error_message(cell_output.traceback))
> pytest.fail(cell_output.traceback[-1])
E Failed: AssertionError: Katib Experiment was not successful.Additional Context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working