You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
395 lines
16 KiB
395 lines
16 KiB
# End-to-end Testing
|
|
|
|
**Authors**: Dominik Fleischmann ([@domFleischmann](https://github.com/domFleischmann)), Kimonas
|
|
Sotirchos ([@kimwnasptd](https://github.com/kimwnasptd)), and Anna Jung ([@annajung](https://github.com/annajung))
|
|
|
|
## Background
|
|
|
|
Previously, the Kubeflow community leveraged prow optional-test-infra for e2e testing with credit from AWS. After the
|
|
optional test infrastructure deprecation notice, all WGs moved their test to GitHub Actions as a temporary solution. Due
|
|
to resource constraints of GitHub-hosted runners, the Kubeflow community stopped supporting e2e tests as part of the
|
|
migration. In partnership with Amazon, a new AWS account has been created with sponsored credits. With the new AWS
|
|
account, the Kubeflow community is no longer limited by resource constraints posed by GitHub Actions. To enable the e2e
|
|
test for the Manifest repo, this doc proposes a design to set up the infrastructure needed to run the necessary tests.
|
|
|
|
References
|
|
|
|
- [Optional Test Infra Deprecation Notice](https://github.com/kubeflow/testing/issues/993)
|
|
- [Alternative solution to removal of test on optional-test-infra](https://github.com/kubeflow/testing/issues/1006)
|
|
|
|
## Goal
|
|
|
|
Enable the e2e testing for the Manifest repo and leverage it to shorten the manifest testing phase of the Kubeflow
|
|
release cycle and to increase quality of the Kubeflow release by ensuring Kubeflow components and dependencies work
|
|
correctly together.
|
|
|
|
## Proposal
|
|
|
|
After some initial conversations, it has been agreed to create integration tests based on GitHub Actions, which will
|
|
spawn an EC2 instance with enough resources to deploy the complete Kubeflow solution and run some end-to-end testing.
|
|
|
|
## Implementation
|
|
|
|
Below lists steps the GitHub actions will perform to complete end-to-end testing
|
|
|
|
- [Create Credentials required by the AWS](#create-credentials-required-by-the-aws)
|
|
- [Create an EC2 instance](#create-an-ec2-instance)
|
|
- [Install a Kubernetes on the instance](#install-a-kubernetes-on-the-instance)
|
|
- [Deploy Kubeflow](#deploy-kubeflow)
|
|
- [Run tests](#run-tests)
|
|
- [Log and report errors](#log-and-report-errors)
|
|
- [Clean up](#clean-up)
|
|
|
|
### Create credentials required by the AWS
|
|
|
|
To leverage AWS, two credentials are required:
|
|
|
|
- `AWS_ACCESS_KEY_ID`: Specifies an AWS access key associated with an IAM user or role.
|
|
- `AWS_SECRET_ACCESS_KEY`: Specifies the secret key associated with the access key. This is essentially the "password"
|
|
for the access key.
|
|
|
|
Both credentials needs to
|
|
be [stored as secrets on GitHub](https://docs.github.com/en/actions/security-guides/encrypted-secrets)
|
|
and will be accessed in a workflow as environment variables.
|
|
|
|
```shell
|
|
env:
|
|
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
```
|
|
|
|
### Create an EC2 instance
|
|
|
|
Access the AWS credentials (stored as GH Secrets) and create an EC2 instance
|
|
|
|
Using [juju](https://juju.is/) as an orchestration, configure AWS credentials and deploy an EC2 instance with the
|
|
following configurations
|
|
|
|
- Image: Ubuntu Server (latest)
|
|
- Type: t3a.xlarge
|
|
- Root disk: 80G
|
|
- Region: us-east-1 (default)
|
|
|
|
#### Why juju?
|
|
|
|
Juju allows easy configuration to various cloud providers. In the future, if there comes a reason to shift to another
|
|
infrastructure provider, it would allow us to pivot quickly.
|
|
|
|
While juju provides more capability, the proposal is to use the tool as config management and a medium to deploy and
|
|
connect with EC2 instances.
|
|
|
|
**Note**: Using GitHub Secrets to store AWS credentials will not allow any forked repositories to access the secrets.
|
|
|
|
### Install a Kubernetes on the Instance
|
|
|
|
Install Kubernetes on the EC2 instance where Kubeflow will be deployed and tested
|
|
|
|
To install Kubernetes, we explored two options and propose to use **KinD**
|
|
|
|
- [Microk8s](#microk8s)
|
|
- [KinD](#kind)
|
|
|
|
#### KinD
|
|
|
|
Using KinD, install Kubernetes with the existing KinD configuration managed by the Manifest WG.
|
|
|
|
```shell
|
|
# Install dependencies - docker
|
|
sudo apt update
|
|
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common tar
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
|
|
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
|
|
apt-cache policy docker-ce
|
|
sudo apt install -y docker-ce
|
|
sudo systemctl status docker
|
|
sudo usermod -a -G docker ubuntu
|
|
|
|
# Install dependencies - kubectl
|
|
sudo curl -L "https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl" -o /usr/local/bin/kubectl
|
|
sudo chmod +x /usr/local/bin/kubectl
|
|
kubectl version --short --client
|
|
|
|
# Install KinD
|
|
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
|
|
chmod +x ./kind
|
|
sudo mv ./kind /usr/local/bin/kind
|
|
|
|
# Deploy kubernetes using KinD
|
|
cd manifests
|
|
kind create cluster --config ./tests/gh-actions/kind-cluster.yaml
|
|
```
|
|
|
|
##### Why KinD?
|
|
|
|
While many tools can be leveraged to deploy Kubernetes, Manifest WG already leverages KinD to run both core and contrib
|
|
component tests. By reusing the tool, we can leverage the existing KinD configuration and keep the similarity between
|
|
component and e2e testing.
|
|
|
|
**Note**: KinD is a subproject of Kubernetes but does not automatically release with a new Kubernetes version and does
|
|
not follow the Kubernetes release cadence. More details can be found at
|
|
[kind/issue#197](https://github.com/kubernetes-sigs/kind/issues/197).
|
|
|
|
### Deploy Kubeflow
|
|
|
|
Deploy Kubeflow, in the same manner, the manifests WG documents.
|
|
|
|
Copy the manifest repo to the AWS instance and use Kustomize to run the Kubeflow installation. After Kustomize
|
|
installation is complete, verify all pods are running.
|
|
|
|
Manifest installation may result in an infinite while loop; therefore, a time limit of 45mins should be set to ensure
|
|
installation exits when a problem occurs with Kubeflow installation.
|
|
|
|
### Run Tests
|
|
|
|
Execute integration tests to verify the correct functioning of different features using python scripts and jupyter
|
|
notebooks.
|
|
|
|
As the first iteration, test the Kubeflow integration using the
|
|
existing [e2e mnist python script](https://github.com/kubeflow/manifests/tree/master/tests/e2e)
|
|
and [e2e mnist notebook](https://github.com/kubeflow/pipelines/blob/master/samples/experimental/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb)
|
|
.
|
|
|
|
- [Python script](#python-script)
|
|
- [Jupyter notebook](#jupyter-notebook)
|
|
|
|
Both python and notebook tests the following:
|
|
|
|
- Kfp and Katib SDK packages (compatibility with other python packages)
|
|
- Creation and execution of a pipeline from a user namespace
|
|
- Creation and execution of hyperparameter running with Katib from a user namespace
|
|
- Creation and execution of distributive training with TFJob from a user namespace
|
|
- Creation and execution of inference using KServe from a user namespace
|
|
|
|
**Note**: The mnist notebook does not test the Kubeflow Notebook resources. In the future, additional verification and
|
|
tests should be added to cover various Kubeflow components and features.
|
|
|
|
#### Python script
|
|
|
|
Step to run e2e python script from the workflow:
|
|
|
|
1. Convert e2e mnist notebook to a python script (
|
|
reuse [mnist.py](https://github.com/kubeflow/manifests/blob/master/tests/e2e/mnist.py))
|
|
2. Run mnist python script outside of the cluster (
|
|
reuse [runner.sh](https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh))
|
|
|
|
#### Jupyter notebook
|
|
|
|
Step to run e2e notebook from the workflow:
|
|
|
|
1. Get e2e mnist notebook
|
|
1. To run the existing e2e mnist notebook, modification needs to be made in the last step to wait for the triggered
|
|
run to finish running before executing. Changes proposed are defined below and a pull request will need to be
|
|
made in the future to avoid copying mnist notebook into the manifest directory.
|
|
|
|
```shell
|
|
import numpy as np
|
|
import time
|
|
from PIL import Image
|
|
import requests
|
|
|
|
# Pipeline Run should be succeeded.
|
|
run_status = kfp_client.get_run(run_id=run_id).run.status
|
|
|
|
if run_status == None:
|
|
print("Waiting for the Run {} to start".format(run_id))
|
|
time.sleep(60)
|
|
run_status = kfp_client.get_run(run_id=run_id).run.status
|
|
|
|
while run_status == "Running":
|
|
print("Run {} is in progress".format(run_id))
|
|
time.sleep(60)
|
|
run_status = kfp_client.get_run(run_id=run_id).run.status
|
|
|
|
if run_status == "Succeeded":
|
|
print("Run {} has Succeeded\n".format(run_id))
|
|
|
|
# Specify the image URL here.
|
|
image_url = "https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/kubeflow-pipelines/images/9.bmp"
|
|
image = Image.open(requests.get(image_url, stream=True).raw)
|
|
data = np.array(image.convert('L').resize((28, 28))).astype(float).reshape(-1, 28, 28, 1)
|
|
data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
|
|
json_request = '{{ "instances" : {} }}'.format(data_formatted)
|
|
|
|
# Specify the prediction URL. If you are runing this notebook outside of Kubernetes cluster, you should set the Cluster IP.
|
|
url = "http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict".format(name, namespace, name)
|
|
|
|
time.sleep(60)
|
|
response = requests.post(url, data=json_request)
|
|
|
|
print("Prediction for the image")
|
|
display(image)
|
|
print(response.json())
|
|
else:
|
|
raise Exception("Run {} failed with status {}\n".format(run_id, kfp_client.get_run(run_id=run_id).run.status))
|
|
```
|
|
2. Move the mnist notebook into the cluster
|
|
```shell
|
|
kubectl -n kubeflow-user-example-com create configmap <configmap name> --from-file kubeflow-e2e-mnist.ipynb
|
|
```
|
|
3. Create a PodDefault to allow access to Kubeflow pipelines
|
|
```shell
|
|
apiVersion: kubeflow.org/v1alpha1
|
|
kind: PodDefault
|
|
metadata:
|
|
name: access-ml-pipeline
|
|
namespace: kubeflow-user-example-com
|
|
spec:
|
|
desc: Allow access to Kubeflow Pipelines
|
|
selector:
|
|
matchLabels:
|
|
access-ml-pipeline: "true"
|
|
env:
|
|
- ## this environment variable is automatically read by `kfp.Client()`
|
|
## this is the default value, but we show it here for clarity
|
|
name: KF_PIPELINES_SA_TOKEN_PATH
|
|
value: /var/run/secrets/kubeflow/pipelines/token
|
|
volumes:
|
|
- name: volume-kf-pipeline-token
|
|
projected:
|
|
sources:
|
|
- serviceAccountToken:
|
|
path: token
|
|
expirationSeconds: 7200
|
|
## defined by the `TOKEN_REVIEW_AUDIENCE` environment variable on the `ml-pipeline` deployment
|
|
audience: pipelines.kubeflow.org
|
|
volumeMounts:
|
|
- mountPath: /var/run/secrets/kubeflow/pipelines
|
|
name: volume-kf-pipeline-token
|
|
readOnly: true
|
|
```
|
|
4. Run the notebook programmatically using a Kubernetes resource Job or Notebook
|
|
```shell
|
|
apiVersion: batch/v1
|
|
kind: Job
|
|
metadata:
|
|
name: test-notebook-job
|
|
namespace: kubeflow-user-example-com
|
|
spec:
|
|
backoffLimit: 1
|
|
activeDeadlineSeconds: 1200
|
|
template:
|
|
metadata:
|
|
labels:
|
|
access-ml-pipeline: "true"
|
|
spec:
|
|
restartPolicy: Never
|
|
initContainers:
|
|
- name: copy-notebook
|
|
image: busybox
|
|
command: ['sh', '-c', 'cp /scripts/* /etc/kubeflow-e2e/']
|
|
volumeMounts:
|
|
- name: e2e-test
|
|
mountPath: /scripts
|
|
- name: kubeflow-e2e
|
|
mountPath: /etc/kubeflow-e2e
|
|
containers:
|
|
- image: kubeflownotebookswg/jupyter-scipy:v1.6.1
|
|
imagePullPolicy: IfNotPresent
|
|
name: execute-notebook
|
|
command:
|
|
- /bin/sh
|
|
- -c
|
|
- |
|
|
jupyter nbconvert --to notebook --execute /etc/kubeflow-e2e/kubeflow-e2e-mnist.ipynb;
|
|
x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit && exit $x;
|
|
volumeMounts:
|
|
- name: kubeflow-e2e
|
|
mountPath: /etc/kubeflow-e2e
|
|
serviceAccountName: default-editor
|
|
volumes:
|
|
- name: e2e-test
|
|
configMap:
|
|
name: e2e-test
|
|
- name: kubeflow-e2e
|
|
emptyDir: {}
|
|
```
|
|
5. Verify Job succeeded or failed
|
|
```shell
|
|
kubectl -n kubeflow-user-example-com wait --for=condition=complete --timeout=1200s job/test-notebook-job
|
|
```
|
|
|
|
### Log and Report Errors
|
|
|
|
Report logs generated in the EC2 instance back to GitHub actions for users.
|
|
|
|
For failures in the workflow steps, generate inspect logs, pod logs, and describe logs. Copy the generated logs back to
|
|
the GitHub Actions system and use [actions/upload-artifact@v2](https://github.com/actions/upload-artifact)
|
|
to allow users to access the logs when necessary.
|
|
|
|
**Note**: As default, artifacts are retained for 90 days. The number of retention days is configurable.
|
|
|
|
### Clean Up
|
|
|
|
Regardless of the success or failure of the workflow, at the end of the workflow, the EC2 instance is deleted to ensure
|
|
there are no resources left behind.
|
|
|
|
## Debugging
|
|
|
|
To debug any failed step of the GitHub Actions
|
|
workflow, [debugging with ssh](https://github.com/marketplace/actions/debugging-with-ssh)
|
|
or other similar tools can be used to ssh into the GitHub system. In the GitHub system, juju can be used to connect to
|
|
an AWS EC2 instance.
|
|
|
|
**Notes**:
|
|
|
|
- GitHub secrets are limited to the Manifest repo and do not cascade to forked repositories. To debug, users must set up
|
|
their own AWS secrets.
|
|
- To debug the AWS EC2 instance without ssh into the GitHub system, you must have access to AWS credentials. Access to
|
|
AWS credentials is limited to [Manifest WG approvers](https://github.com/kubeflow/manifests/blob/master/OWNERS).
|
|
|
|
## Proof of Concept Workflow
|
|
|
|
The POC code
|
|
is [available](https://github.com/DomFleischmann/manifests/blob/aj-dev/.github/workflows/aws_e2e_tests.yaml)
|
|
with examples of both [successful](https://github.com/DomFleischmann/manifests/actions/runs/4118561167/jobs/7111228604)
|
|
and [failed](https://github.com/DomFleischmann/manifests/actions/runs/4119052861) runs.
|
|
|
|
The proposed end-to-end workflow has been tested with the following Kubernetes and Kubeflow versions
|
|
|
|
- 1.22 Kubernetes and [1.6.1 Kubeflow release](https://github.com/kubeflow/manifests/releases/tag/v1.6.1) (microk8s)
|
|
- 1.24 Kubernetes and main branch of the manifest
|
|
repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
|
|
microk8s)
|
|
- 1.25 Kuberentes and main branch of the manifest
|
|
repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
|
|
kind)
|
|
|
|
### Alternative solutions considered
|
|
|
|
#### Prow
|
|
|
|
While there are some existing tests with Prow, those tests were discarded due to them not having been updated in 2 years
|
|
and there being a high amount of complexity in these tests. After some investigation, the Manifests Working Group
|
|
decided that it would be more work adapting those tests to the current state of manifests than starting from scratch
|
|
with lower complexity.
|
|
|
|
#### Self-hosted runners
|
|
|
|
Self-hosted runners are not recommended with public repositories due to security concerns with how it behaves on a pull
|
|
request made by a forked repository.
|
|
|
|
#### MicroK8s
|
|
|
|
Instead of KinD, [microk8s](https://microk8s.io/) was considered as an alternative to install Kubernetes.
|
|
|
|
Below shows the steps required in the workflow to install microk8s and to install Kubernetes using microk8s. During
|
|
the Kubernetes installation, you must enable [dns](https://microk8s.io/docs/addon-dns),
|
|
[storage](https://microk8s.io/docs/addon-hostpath-storage), [ingress](https://microk8s.io/docs/addon-ingress),
|
|
[loadbalancer](https://microk8s.io/docs/addon-metallb), and [rbac](https://microk8s.io/docs/multi-user).
|
|
|
|
```shell
|
|
# Install microk8s
|
|
sudo snap install microk8s --classic --channel ${{ matrix.microk8s }}
|
|
sudo apt update
|
|
sudo usermod -a -G microk8s ubuntu
|
|
|
|
# Install dependencies - kubectl
|
|
sudo snap alias microk8s.kubectl kubectl
|
|
|
|
# Deploy kubernetes using microk8s
|
|
sudo snap install microk8s --classic --channel 1.24/stable
|
|
microk8s enable dns hostpath-storage ingress metallb:10.64.140.43-10.64.140.49 rbac
|
|
```
|
|
|
|
**Note**: microk8s requires IP address pool when enabling dns, address pool of 10.64.140.43-10.64.140.49 is an arbitrary
|
|
decision. |