Repository for dip
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
tenant-catalog/kubeflow/proposals/20230323-end-to-end-testing.md

395 lines
16 KiB

# End-to-end Testing
**Authors**: Dominik Fleischmann ([@domFleischmann](https://github.com/domFleischmann)), Kimonas
Sotirchos ([@kimwnasptd](https://github.com/kimwnasptd)), and Anna Jung ([@annajung](https://github.com/annajung))
## Background
Previously, the Kubeflow community leveraged prow optional-test-infra for e2e testing with credit from AWS. After the
optional test infrastructure deprecation notice, all WGs moved their test to GitHub Actions as a temporary solution. Due
to resource constraints of GitHub-hosted runners, the Kubeflow community stopped supporting e2e tests as part of the
migration. In partnership with Amazon, a new AWS account has been created with sponsored credits. With the new AWS
account, the Kubeflow community is no longer limited by resource constraints posed by GitHub Actions. To enable the e2e
test for the Manifest repo, this doc proposes a design to set up the infrastructure needed to run the necessary tests.
References
- [Optional Test Infra Deprecation Notice](https://github.com/kubeflow/testing/issues/993)
- [Alternative solution to removal of test on optional-test-infra](https://github.com/kubeflow/testing/issues/1006)
## Goal
Enable the e2e testing for the Manifest repo and leverage it to shorten the manifest testing phase of the Kubeflow
release cycle and to increase quality of the Kubeflow release by ensuring Kubeflow components and dependencies work
correctly together.
## Proposal
After some initial conversations, it has been agreed to create integration tests based on GitHub Actions, which will
spawn an EC2 instance with enough resources to deploy the complete Kubeflow solution and run some end-to-end testing.
## Implementation
Below lists steps the GitHub actions will perform to complete end-to-end testing
- [Create Credentials required by the AWS](#create-credentials-required-by-the-aws)
- [Create an EC2 instance](#create-an-ec2-instance)
- [Install a Kubernetes on the instance](#install-a-kubernetes-on-the-instance)
- [Deploy Kubeflow](#deploy-kubeflow)
- [Run tests](#run-tests)
- [Log and report errors](#log-and-report-errors)
- [Clean up](#clean-up)
### Create credentials required by the AWS
To leverage AWS, two credentials are required:
- `AWS_ACCESS_KEY_ID`: Specifies an AWS access key associated with an IAM user or role.
- `AWS_SECRET_ACCESS_KEY`: Specifies the secret key associated with the access key. This is essentially the "password"
for the access key.
Both credentials needs to
be [stored as secrets on GitHub](https://docs.github.com/en/actions/security-guides/encrypted-secrets)
and will be accessed in a workflow as environment variables.
```shell
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
```
### Create an EC2 instance
Access the AWS credentials (stored as GH Secrets) and create an EC2 instance
Using [juju](https://juju.is/) as an orchestration, configure AWS credentials and deploy an EC2 instance with the
following configurations
- Image: Ubuntu Server (latest)
- Type: t3a.xlarge
- Root disk: 80G
- Region: us-east-1 (default)
#### Why juju?
Juju allows easy configuration to various cloud providers. In the future, if there comes a reason to shift to another
infrastructure provider, it would allow us to pivot quickly.
While juju provides more capability, the proposal is to use the tool as config management and a medium to deploy and
connect with EC2 instances.
**Note**: Using GitHub Secrets to store AWS credentials will not allow any forked repositories to access the secrets.
### Install a Kubernetes on the Instance
Install Kubernetes on the EC2 instance where Kubeflow will be deployed and tested
To install Kubernetes, we explored two options and propose to use **KinD**
- [Microk8s](#microk8s)
- [KinD](#kind)
#### KinD
Using KinD, install Kubernetes with the existing KinD configuration managed by the Manifest WG.
```shell
# Install dependencies - docker
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common tar
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker
sudo usermod -a -G docker ubuntu
# Install dependencies - kubectl
sudo curl -L "https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl" -o /usr/local/bin/kubectl
sudo chmod +x /usr/local/bin/kubectl
kubectl version --short --client
# Install KinD
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# Deploy kubernetes using KinD
cd manifests
kind create cluster --config ./tests/gh-actions/kind-cluster.yaml
```
##### Why KinD?
While many tools can be leveraged to deploy Kubernetes, Manifest WG already leverages KinD to run both core and contrib
component tests. By reusing the tool, we can leverage the existing KinD configuration and keep the similarity between
component and e2e testing.
**Note**: KinD is a subproject of Kubernetes but does not automatically release with a new Kubernetes version and does
not follow the Kubernetes release cadence. More details can be found at
[kind/issue#197](https://github.com/kubernetes-sigs/kind/issues/197).
### Deploy Kubeflow
Deploy Kubeflow, in the same manner, the manifests WG documents.
Copy the manifest repo to the AWS instance and use Kustomize to run the Kubeflow installation. After Kustomize
installation is complete, verify all pods are running.
Manifest installation may result in an infinite while loop; therefore, a time limit of 45mins should be set to ensure
installation exits when a problem occurs with Kubeflow installation.
### Run Tests
Execute integration tests to verify the correct functioning of different features using python scripts and jupyter
notebooks.
As the first iteration, test the Kubeflow integration using the
existing [e2e mnist python script](https://github.com/kubeflow/manifests/tree/master/tests/e2e)
and [e2e mnist notebook](https://github.com/kubeflow/pipelines/blob/master/samples/experimental/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb)
.
- [Python script](#python-script)
- [Jupyter notebook](#jupyter-notebook)
Both python and notebook tests the following:
- Kfp and Katib SDK packages (compatibility with other python packages)
- Creation and execution of a pipeline from a user namespace
- Creation and execution of hyperparameter running with Katib from a user namespace
- Creation and execution of distributive training with TFJob from a user namespace
- Creation and execution of inference using KServe from a user namespace
**Note**: The mnist notebook does not test the Kubeflow Notebook resources. In the future, additional verification and
tests should be added to cover various Kubeflow components and features.
#### Python script
Step to run e2e python script from the workflow:
1. Convert e2e mnist notebook to a python script (
reuse [mnist.py](https://github.com/kubeflow/manifests/blob/master/tests/e2e/mnist.py))
2. Run mnist python script outside of the cluster (
reuse [runner.sh](https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh))
#### Jupyter notebook
Step to run e2e notebook from the workflow:
1. Get e2e mnist notebook
1. To run the existing e2e mnist notebook, modification needs to be made in the last step to wait for the triggered
run to finish running before executing. Changes proposed are defined below and a pull request will need to be
made in the future to avoid copying mnist notebook into the manifest directory.
```shell
import numpy as np
import time
from PIL import Image
import requests
# Pipeline Run should be succeeded.
run_status = kfp_client.get_run(run_id=run_id).run.status
if run_status == None:
print("Waiting for the Run {} to start".format(run_id))
time.sleep(60)
run_status = kfp_client.get_run(run_id=run_id).run.status
while run_status == "Running":
print("Run {} is in progress".format(run_id))
time.sleep(60)
run_status = kfp_client.get_run(run_id=run_id).run.status
if run_status == "Succeeded":
print("Run {} has Succeeded\n".format(run_id))
# Specify the image URL here.
image_url = "https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/kubeflow-pipelines/images/9.bmp"
image = Image.open(requests.get(image_url, stream=True).raw)
data = np.array(image.convert('L').resize((28, 28))).astype(float).reshape(-1, 28, 28, 1)
data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
json_request = '{{ "instances" : {} }}'.format(data_formatted)
# Specify the prediction URL. If you are runing this notebook outside of Kubernetes cluster, you should set the Cluster IP.
url = "http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict".format(name, namespace, name)
time.sleep(60)
response = requests.post(url, data=json_request)
print("Prediction for the image")
display(image)
print(response.json())
else:
raise Exception("Run {} failed with status {}\n".format(run_id, kfp_client.get_run(run_id=run_id).run.status))
```
2. Move the mnist notebook into the cluster
```shell
kubectl -n kubeflow-user-example-com create configmap <configmap name> --from-file kubeflow-e2e-mnist.ipynb
```
3. Create a PodDefault to allow access to Kubeflow pipelines
```shell
apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
name: access-ml-pipeline
namespace: kubeflow-user-example-com
spec:
desc: Allow access to Kubeflow Pipelines
selector:
matchLabels:
access-ml-pipeline: "true"
env:
- ## this environment variable is automatically read by `kfp.Client()`
## this is the default value, but we show it here for clarity
name: KF_PIPELINES_SA_TOKEN_PATH
value: /var/run/secrets/kubeflow/pipelines/token
volumes:
- name: volume-kf-pipeline-token
projected:
sources:
- serviceAccountToken:
path: token
expirationSeconds: 7200
## defined by the `TOKEN_REVIEW_AUDIENCE` environment variable on the `ml-pipeline` deployment
audience: pipelines.kubeflow.org
volumeMounts:
- mountPath: /var/run/secrets/kubeflow/pipelines
name: volume-kf-pipeline-token
readOnly: true
```
4. Run the notebook programmatically using a Kubernetes resource Job or Notebook
```shell
apiVersion: batch/v1
kind: Job
metadata:
name: test-notebook-job
namespace: kubeflow-user-example-com
spec:
backoffLimit: 1
activeDeadlineSeconds: 1200
template:
metadata:
labels:
access-ml-pipeline: "true"
spec:
restartPolicy: Never
initContainers:
- name: copy-notebook
image: busybox
command: ['sh', '-c', 'cp /scripts/* /etc/kubeflow-e2e/']
volumeMounts:
- name: e2e-test
mountPath: /scripts
- name: kubeflow-e2e
mountPath: /etc/kubeflow-e2e
containers:
- image: kubeflownotebookswg/jupyter-scipy:v1.6.1
imagePullPolicy: IfNotPresent
name: execute-notebook
command:
- /bin/sh
- -c
- |
jupyter nbconvert --to notebook --execute /etc/kubeflow-e2e/kubeflow-e2e-mnist.ipynb;
x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit && exit $x;
volumeMounts:
- name: kubeflow-e2e
mountPath: /etc/kubeflow-e2e
serviceAccountName: default-editor
volumes:
- name: e2e-test
configMap:
name: e2e-test
- name: kubeflow-e2e
emptyDir: {}
```
5. Verify Job succeeded or failed
```shell
kubectl -n kubeflow-user-example-com wait --for=condition=complete --timeout=1200s job/test-notebook-job
```
### Log and Report Errors
Report logs generated in the EC2 instance back to GitHub actions for users.
For failures in the workflow steps, generate inspect logs, pod logs, and describe logs. Copy the generated logs back to
the GitHub Actions system and use [actions/upload-artifact@v2](https://github.com/actions/upload-artifact)
to allow users to access the logs when necessary.
**Note**: As default, artifacts are retained for 90 days. The number of retention days is configurable.
### Clean Up
Regardless of the success or failure of the workflow, at the end of the workflow, the EC2 instance is deleted to ensure
there are no resources left behind.
## Debugging
To debug any failed step of the GitHub Actions
workflow, [debugging with ssh](https://github.com/marketplace/actions/debugging-with-ssh)
or other similar tools can be used to ssh into the GitHub system. In the GitHub system, juju can be used to connect to
an AWS EC2 instance.
**Notes**:
- GitHub secrets are limited to the Manifest repo and do not cascade to forked repositories. To debug, users must set up
their own AWS secrets.
- To debug the AWS EC2 instance without ssh into the GitHub system, you must have access to AWS credentials. Access to
AWS credentials is limited to [Manifest WG approvers](https://github.com/kubeflow/manifests/blob/master/OWNERS).
## Proof of Concept Workflow
The POC code
is [available](https://github.com/DomFleischmann/manifests/blob/aj-dev/.github/workflows/aws_e2e_tests.yaml)
with examples of both [successful](https://github.com/DomFleischmann/manifests/actions/runs/4118561167/jobs/7111228604)
and [failed](https://github.com/DomFleischmann/manifests/actions/runs/4119052861) runs.
The proposed end-to-end workflow has been tested with the following Kubernetes and Kubeflow versions
- 1.22 Kubernetes and [1.6.1 Kubeflow release](https://github.com/kubeflow/manifests/releases/tag/v1.6.1) (microk8s)
- 1.24 Kubernetes and main branch of the manifest
repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
microk8s)
- 1.25 Kuberentes and main branch of the manifest
repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
kind)
### Alternative solutions considered
#### Prow
While there are some existing tests with Prow, those tests were discarded due to them not having been updated in 2 years
and there being a high amount of complexity in these tests. After some investigation, the Manifests Working Group
decided that it would be more work adapting those tests to the current state of manifests than starting from scratch
with lower complexity.
#### Self-hosted runners
Self-hosted runners are not recommended with public repositories due to security concerns with how it behaves on a pull
request made by a forked repository.
#### MicroK8s
Instead of KinD, [microk8s](https://microk8s.io/) was considered as an alternative to install Kubernetes.
Below shows the steps required in the workflow to install microk8s and to install Kubernetes using microk8s. During
the Kubernetes installation, you must enable [dns](https://microk8s.io/docs/addon-dns),
[storage](https://microk8s.io/docs/addon-hostpath-storage), [ingress](https://microk8s.io/docs/addon-ingress),
[loadbalancer](https://microk8s.io/docs/addon-metallb), and [rbac](https://microk8s.io/docs/multi-user).
```shell
# Install microk8s
sudo snap install microk8s --classic --channel ${{ matrix.microk8s }}
sudo apt update
sudo usermod -a -G microk8s ubuntu
# Install dependencies - kubectl
sudo snap alias microk8s.kubectl kubectl
# Deploy kubernetes using microk8s
sudo snap install microk8s --classic --channel 1.24/stable
microk8s enable dns hostpath-storage ingress metallb:10.64.140.43-10.64.140.49 rbac
```
**Note**: microk8s requires IP address pool when enabling dns, address pool of 10.64.140.43-10.64.140.49 is an arbitrary
decision.