tenant-catalog/kubeflow/proposals/20230323-end-to-end-testing.md

# End-to-end Testing

**Authors**: Dominik Fleischmann ([@domFleischmann](https://github.com/domFleischmann)), Kimonas
Sotirchos ([@kimwnasptd](https://github.com/kimwnasptd)), and Anna Jung ([@annajung](https://github.com/annajung))

## Background

Previously, the Kubeflow community leveraged prow optional-test-infra for e2e testing with credit from AWS. After the
optional test infrastructure deprecation notice, all WGs moved their test to GitHub Actions as a temporary solution. Due
to resource constraints of GitHub-hosted runners, the Kubeflow community stopped supporting e2e tests as part of the
migration. In partnership with Amazon, a new AWS account has been created with sponsored credits. With the new AWS
account, the Kubeflow community is no longer limited by resource constraints posed by GitHub Actions. To enable the e2e
test for the Manifest repo, this doc proposes a design to set up the infrastructure needed to run the necessary tests.

References

- [Optional Test Infra Deprecation Notice](https://github.com/kubeflow/testing/issues/993)
- [Alternative solution to removal of test on optional-test-infra](https://github.com/kubeflow/testing/issues/1006)

## Goal

Enable the e2e testing for the Manifest repo and leverage it to shorten the manifest testing phase of the Kubeflow
release cycle and to increase quality of the Kubeflow release by ensuring Kubeflow components and dependencies work
correctly together.

## Proposal

After some initial conversations, it has been agreed to create integration tests based on GitHub Actions, which will
spawn an EC2 instance with enough resources to deploy the complete Kubeflow solution and run some end-to-end testing.

## Implementation

Below lists steps the GitHub actions will perform to complete end-to-end testing

- [Create Credentials required by the AWS](#create-credentials-required-by-the-aws)
- [Create an EC2 instance](#create-an-ec2-instance)
- [Install a Kubernetes on the instance](#install-a-kubernetes-on-the-instance)
- [Deploy Kubeflow](#deploy-kubeflow)
- [Run tests](#run-tests)
- [Log and report errors](#log-and-report-errors)
- [Clean up](#clean-up)

### Create credentials required by the AWS

To leverage AWS, two credentials are required:

- `AWS_ACCESS_KEY_ID`: Specifies an AWS access key associated with an IAM user or role.
- `AWS_SECRET_ACCESS_KEY`: Specifies the secret key associated with the access key. This is essentially the "password"
  for the access key.

Both credentials needs to
be [stored as secrets on GitHub](https://docs.github.com/en/actions/security-guides/encrypted-secrets)
and will be accessed in a workflow as environment variables.

```shell
env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
```

### Create an EC2 instance

Access the AWS credentials (stored as GH Secrets) and create an EC2 instance

Using [juju](https://juju.is/) as an orchestration, configure AWS credentials and deploy an EC2 instance with the
following configurations

- Image: Ubuntu Server (latest)
- Type: t3a.xlarge
- Root disk: 80G
- Region: us-east-1 (default)

#### Why juju?

Juju allows easy configuration to various cloud providers. In the future, if there comes a reason to shift to another
infrastructure provider, it would allow us to pivot quickly.

While juju provides more capability, the proposal is to use the tool as config management and a medium to deploy and
connect with EC2 instances.

**Note**: Using GitHub Secrets to store AWS credentials will not allow any forked repositories to access the secrets.

### Install a Kubernetes on the Instance

Install Kubernetes on the EC2 instance where Kubeflow will be deployed and tested

To install Kubernetes, we explored two options and propose to use **KinD**

- [Microk8s](#microk8s)
- [KinD](#kind)

#### KinD

Using KinD, install Kubernetes with the existing KinD configuration managed by the Manifest WG.

```shell
# Install dependencies - docker
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common tar
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker
sudo usermod -a -G docker ubuntu

# Install dependencies - kubectl
sudo curl -L "https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl" -o /usr/local/bin/kubectl
sudo chmod +x /usr/local/bin/kubectl
kubectl version --short --client

# Install KinD
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Deploy kubernetes using KinD
cd manifests
kind create cluster --config ./tests/gh-actions/kind-cluster.yaml
```

##### Why KinD?

While many tools can be leveraged to deploy Kubernetes, Manifest WG already leverages KinD to run both core and contrib
component tests. By reusing the tool, we can leverage the existing KinD configuration and keep the similarity between
component and e2e testing.

**Note**: KinD is a subproject of Kubernetes but does not automatically release with a new Kubernetes version and does
not follow the Kubernetes release cadence. More details can be found at
[kind/issue#197](https://github.com/kubernetes-sigs/kind/issues/197).

### Deploy Kubeflow

Deploy Kubeflow, in the same manner, the manifests WG documents.

Copy the manifest repo to the AWS instance and use Kustomize to run the Kubeflow installation. After Kustomize
installation is complete, verify all pods are running.

Manifest installation may result in an infinite while loop; therefore, a time limit of 45mins should be set to ensure
installation exits when a problem occurs with Kubeflow installation.

### Run Tests

Execute integration tests to verify the correct functioning of different features using python scripts and jupyter
notebooks.

As the first iteration, test the Kubeflow integration using the
existing [e2e mnist python script](https://github.com/kubeflow/manifests/tree/master/tests/e2e)
and [e2e mnist notebook](https://github.com/kubeflow/pipelines/blob/master/samples/experimental/kubeflow-e2e-mnist/kubeflow-e2e-mnist.ipynb)
.

- [Python script](#python-script)
- [Jupyter notebook](#jupyter-notebook)

Both python and notebook tests the following:

- Kfp and Katib SDK packages (compatibility with other python packages)
- Creation and execution of a pipeline from a user namespace
- Creation and execution of hyperparameter running with Katib from a user namespace
- Creation and execution of distributive training with TFJob from a user namespace
- Creation and execution of inference using KServe from a user namespace

**Note**: The mnist notebook does not test the Kubeflow Notebook resources. In the future, additional verification and
tests should be added to cover various Kubeflow components and features.

#### Python script

Step to run e2e python script from the workflow:

1. Convert e2e mnist notebook to a python script (
   reuse [mnist.py](https://github.com/kubeflow/manifests/blob/master/tests/e2e/mnist.py))
2. Run mnist python script outside of the cluster (
   reuse [runner.sh](https://github.com/kubeflow/manifests/blob/master/tests/e2e/runner.sh))

#### Jupyter notebook

Step to run e2e notebook from the workflow:

1. Get e2e mnist notebook
    1. To run the existing e2e mnist notebook, modification needs to be made in the last step to wait for the triggered
       run to finish running before executing. Changes proposed are defined below and a pull request will need to be
       made in the future to avoid copying mnist notebook into the manifest directory.

    ```shell
    import numpy as np
    import time
    from PIL import Image
    import requests

    # Pipeline Run should be succeeded.
    run_status = kfp_client.get_run(run_id=run_id).run.status

    if run_status == None:
        print("Waiting for the Run {} to start".format(run_id))
        time.sleep(60)
        run_status = kfp_client.get_run(run_id=run_id).run.status

    while run_status == "Running":
        print("Run {} is in progress".format(run_id))
        time.sleep(60)
        run_status = kfp_client.get_run(run_id=run_id).run.status

    if run_status == "Succeeded":
        print("Run {} has Succeeded\n".format(run_id))

        # Specify the image URL here.
        image_url = "https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/kubeflow-pipelines/images/9.bmp"
        image = Image.open(requests.get(image_url, stream=True).raw)
        data = np.array(image.convert('L').resize((28, 28))).astype(float).reshape(-1, 28, 28, 1)
        data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
        json_request = '{{ "instances" : {} }}'.format(data_formatted)

        # Specify the prediction URL. If you are runing this notebook outside of Kubernetes cluster, you should set the Cluster IP.
        url = "http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict".format(name, namespace, name)

        time.sleep(60)
        response = requests.post(url, data=json_request)

        print("Prediction for the image")
        display(image)
        print(response.json())
    else:
        raise Exception("Run {} failed with status {}\n".format(run_id, kfp_client.get_run(run_id=run_id).run.status))
    ```
2. Move the mnist notebook into the cluster
    ```shell
    kubectl -n kubeflow-user-example-com create configmap <configmap name> --from-file kubeflow-e2e-mnist.ipynb
    ```
3. Create a PodDefault to allow access to Kubeflow pipelines
    ```shell
    apiVersion: kubeflow.org/v1alpha1
    kind: PodDefault
    metadata:
      name: access-ml-pipeline
      namespace: kubeflow-user-example-com
    spec:
      desc: Allow access to Kubeflow Pipelines
      selector:
        matchLabels:
          access-ml-pipeline: "true"
      env:
        - ## this environment variable is automatically read by `kfp.Client()`
          ## this is the default value, but we show it here for clarity
          name: KF_PIPELINES_SA_TOKEN_PATH
          value: /var/run/secrets/kubeflow/pipelines/token
      volumes:
        - name: volume-kf-pipeline-token
          projected:
            sources:
              - serviceAccountToken:
                  path: token
                  expirationSeconds: 7200
                  ## defined by the `TOKEN_REVIEW_AUDIENCE` environment variable on the `ml-pipeline` deployment
                  audience: pipelines.kubeflow.org
      volumeMounts:
        - mountPath: /var/run/secrets/kubeflow/pipelines
          name: volume-kf-pipeline-token
          readOnly: true
    ```
4. Run the notebook programmatically using a Kubernetes resource Job or Notebook
    ```shell
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: test-notebook-job
      namespace: kubeflow-user-example-com
    spec:
      backoffLimit: 1
      activeDeadlineSeconds: 1200
      template:
        metadata:
          labels:
            access-ml-pipeline: "true"
        spec:
          restartPolicy: Never
          initContainers:
          - name: copy-notebook
            image: busybox
            command: ['sh', '-c', 'cp /scripts/* /etc/kubeflow-e2e/']
            volumeMounts:
              - name: e2e-test
                mountPath: /scripts
              - name: kubeflow-e2e
                mountPath: /etc/kubeflow-e2e
          containers:
            - image: kubeflownotebookswg/jupyter-scipy:v1.6.1
              imagePullPolicy: IfNotPresent
              name: execute-notebook
              command:
                - /bin/sh
                - -c
                - |
                  jupyter nbconvert --to notebook --execute /etc/kubeflow-e2e/kubeflow-e2e-mnist.ipynb;
                  x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit && exit $x;
              volumeMounts:
                - name: kubeflow-e2e
                  mountPath: /etc/kubeflow-e2e
          serviceAccountName: default-editor
          volumes:
            - name: e2e-test
              configMap:
                name: e2e-test
            - name: kubeflow-e2e
              emptyDir: {}
    ```
5. Verify Job succeeded or failed
    ```shell
    kubectl -n kubeflow-user-example-com wait --for=condition=complete --timeout=1200s job/test-notebook-job
    ```

### Log and Report Errors

Report logs generated in the EC2 instance back to GitHub actions for users.

For failures in the workflow steps, generate inspect logs, pod logs, and describe logs. Copy the generated logs back to
the GitHub Actions system and use [actions/upload-artifact@v2](https://github.com/actions/upload-artifact)
to allow users to access the logs when necessary.

**Note**: As default, artifacts are retained for 90 days. The number of retention days is configurable.

### Clean Up

Regardless of the success or failure of the workflow, at the end of the workflow, the EC2 instance is deleted to ensure
there are no resources left behind.

## Debugging

To debug any failed step of the GitHub Actions
workflow, [debugging with ssh](https://github.com/marketplace/actions/debugging-with-ssh)
or other similar tools can be used to ssh into the GitHub system. In the GitHub system, juju can be used to connect to
an AWS EC2 instance.

**Notes**:

- GitHub secrets are limited to the Manifest repo and do not cascade to forked repositories. To debug, users must set up
  their own AWS secrets.
- To debug the AWS EC2 instance without ssh into the GitHub system, you must have access to AWS credentials. Access to
  AWS credentials is limited to [Manifest WG approvers](https://github.com/kubeflow/manifests/blob/master/OWNERS).

## Proof of Concept Workflow

The POC code
is [available](https://github.com/DomFleischmann/manifests/blob/aj-dev/.github/workflows/aws_e2e_tests.yaml)
with examples of both [successful](https://github.com/DomFleischmann/manifests/actions/runs/4118561167/jobs/7111228604)
and [failed](https://github.com/DomFleischmann/manifests/actions/runs/4119052861) runs.

The proposed end-to-end workflow has been tested with the following Kubernetes and Kubeflow versions

- 1.22 Kubernetes and [1.6.1 Kubeflow release](https://github.com/kubeflow/manifests/releases/tag/v1.6.1) (microk8s)
- 1.24 Kubernetes and main branch of the manifest
  repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
  microk8s)
- 1.25 Kuberentes and main branch of the manifest
  repo ([last commit](https://github.com/DomFleischmann/manifests/commit/8e5714171f1fd5b00f59f436e9ab8cb45a0f30e3)) (
  kind)

### Alternative solutions considered

#### Prow

While there are some existing tests with Prow, those tests were discarded due to them not having been updated in 2 years
and there being a high amount of complexity in these tests. After some investigation, the Manifests Working Group
decided that it would be more work adapting those tests to the current state of manifests than starting from scratch
with lower complexity.

#### Self-hosted runners

Self-hosted runners are not recommended with public repositories due to security concerns with how it behaves on a pull
request made by a forked repository.

#### MicroK8s

Instead of KinD, [microk8s](https://microk8s.io/) was considered as an alternative to install Kubernetes.

Below shows the steps required in the workflow to install microk8s and to install Kubernetes using microk8s. During
the Kubernetes installation, you must enable [dns](https://microk8s.io/docs/addon-dns),
[storage](https://microk8s.io/docs/addon-hostpath-storage), [ingress](https://microk8s.io/docs/addon-ingress),
[loadbalancer](https://microk8s.io/docs/addon-metallb), and [rbac](https://microk8s.io/docs/multi-user).

```shell
# Install microk8s
sudo snap install microk8s --classic --channel ${{ matrix.microk8s }}
sudo apt update
sudo usermod -a -G microk8s ubuntu

# Install dependencies - kubectl
sudo snap alias microk8s.kubectl kubectl

# Deploy kubernetes using microk8s
sudo snap install microk8s --classic --channel 1.24/stable
microk8s enable dns hostpath-storage ingress metallb:10.64.140.43-10.64.140.49 rbac
```

**Note**: microk8s requires IP address pool when enabling dns, address pool of 10.64.140.43-10.64.140.49 is an arbitrary
decision.