Repository for dip

16 KiB

Raw Blame History

End-to-end Testing

Authors: Dominik Fleischmann (@domFleischmann), Kimonas Sotirchos (@kimwnasptd), and Anna Jung (@annajung)

Background

Previously, the Kubeflow community leveraged prow optional-test-infra for e2e testing with credit from AWS. After the optional test infrastructure deprecation notice, all WGs moved their test to GitHub Actions as a temporary solution. Due to resource constraints of GitHub-hosted runners, the Kubeflow community stopped supporting e2e tests as part of the migration. In partnership with Amazon, a new AWS account has been created with sponsored credits. With the new AWS account, the Kubeflow community is no longer limited by resource constraints posed by GitHub Actions. To enable the e2e test for the Manifest repo, this doc proposes a design to set up the infrastructure needed to run the necessary tests.

References

Goal

Enable the e2e testing for the Manifest repo and leverage it to shorten the manifest testing phase of the Kubeflow release cycle and to increase quality of the Kubeflow release by ensuring Kubeflow components and dependencies work correctly together.

Proposal

After some initial conversations, it has been agreed to create integration tests based on GitHub Actions, which will spawn an EC2 instance with enough resources to deploy the complete Kubeflow solution and run some end-to-end testing.

Implementation

Below lists steps the GitHub actions will perform to complete end-to-end testing

Create Credentials required by the AWS
Create an EC2 instance
Install a Kubernetes on the instance
Deploy Kubeflow
Run tests
Log and report errors
Clean up

Create credentials required by the AWS

To leverage AWS, two credentials are required:

AWS_ACCESS_KEY_ID: Specifies an AWS access key associated with an IAM user or role.
AWS_SECRET_ACCESS_KEY: Specifies the secret key associated with the access key. This is essentially the "password" for the access key.

Both credentials needs to be stored as secrets on GitHub and will be accessed in a workflow as environment variables.

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Create an EC2 instance

Access the AWS credentials (stored as GH Secrets) and create an EC2 instance

Using juju as an orchestration, configure AWS credentials and deploy an EC2 instance with the following configurations

Image: Ubuntu Server (latest)
Type: t3a.xlarge
Root disk: 80G
Region: us-east-1 (default)

Why juju?

Juju allows easy configuration to various cloud providers. In the future, if there comes a reason to shift to another infrastructure provider, it would allow us to pivot quickly.

While juju provides more capability, the proposal is to use the tool as config management and a medium to deploy and connect with EC2 instances.

Note: Using GitHub Secrets to store AWS credentials will not allow any forked repositories to access the secrets.

Install a Kubernetes on the Instance

Install Kubernetes on the EC2 instance where Kubeflow will be deployed and tested

To install Kubernetes, we explored two options and propose to use KinD

Microk8s
KinD

KinD

Using KinD, install Kubernetes with the existing KinD configuration managed by the Manifest WG.

# Install dependencies - docker
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common tar
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
apt-cache policy docker-ce
sudo apt install -y docker-ce
sudo systemctl status docker
sudo usermod -a -G docker ubuntu

# Install dependencies - kubectl
sudo curl -L "https://storage.googleapis.com/kubernetes-release/release/`curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt`/bin/linux/amd64/kubectl" -o /usr/local/bin/kubectl
sudo chmod +x /usr/local/bin/kubectl
kubectl version --short --client

# Install KinD
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.17.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# Deploy kubernetes using KinD
cd manifests
kind create cluster --config ./tests/gh-actions/kind-cluster.yaml

Why KinD?

While many tools can be leveraged to deploy Kubernetes, Manifest WG already leverages KinD to run both core and contrib component tests. By reusing the tool, we can leverage the existing KinD configuration and keep the similarity between component and e2e testing.

Note: KinD is a subproject of Kubernetes but does not automatically release with a new Kubernetes version and does not follow the Kubernetes release cadence. More details can be found at kind/issue#197.

Deploy Kubeflow

Deploy Kubeflow, in the same manner, the manifests WG documents.

Copy the manifest repo to the AWS instance and use Kustomize to run the Kubeflow installation. After Kustomize installation is complete, verify all pods are running.

Manifest installation may result in an infinite while loop; therefore, a time limit of 45mins should be set to ensure installation exits when a problem occurs with Kubeflow installation.

Run Tests

Execute integration tests to verify the correct functioning of different features using python scripts and jupyter notebooks.

As the first iteration, test the Kubeflow integration using the existing e2e mnist python script and e2e mnist notebook .

Python script
Jupyter notebook

Both python and notebook tests the following:

Kfp and Katib SDK packages (compatibility with other python packages)
Creation and execution of a pipeline from a user namespace
Creation and execution of hyperparameter running with Katib from a user namespace
Creation and execution of distributive training with TFJob from a user namespace
Creation and execution of inference using KServe from a user namespace

Note: The mnist notebook does not test the Kubeflow Notebook resources. In the future, additional verification and tests should be added to cover various Kubeflow components and features.

Python script

Step to run e2e python script from the workflow:

Convert e2e mnist notebook to a python script ( reuse mnist.py)
Run mnist python script outside of the cluster ( reuse runner.sh)

Jupyter notebook

Step to run e2e notebook from the workflow:

Get e2e mnist notebook

To run the existing e2e mnist notebook, modification needs to be made in the last step to wait for the triggered run to finish running before executing. Changes proposed are defined below and a pull request will need to be made in the future to avoid copying mnist notebook into the manifest directory.

import numpy as np
import time
from PIL import Image
import requests

# Pipeline Run should be succeeded.
run_status = kfp_client.get_run(run_id=run_id).run.status

if run_status == None:
    print("Waiting for the Run {} to start".format(run_id))
    time.sleep(60)
    run_status = kfp_client.get_run(run_id=run_id).run.status

while run_status == "Running":
    print("Run {} is in progress".format(run_id))
    time.sleep(60)
    run_status = kfp_client.get_run(run_id=run_id).run.status

if run_status == "Succeeded":
    print("Run {} has Succeeded\n".format(run_id))

    # Specify the image URL here.
    image_url = "https://raw.githubusercontent.com/kubeflow/katib/master/examples/v1beta1/kubeflow-pipelines/images/9.bmp"
    image = Image.open(requests.get(image_url, stream=True).raw)
    data = np.array(image.convert('L').resize((28, 28))).astype(float).reshape(-1, 28, 28, 1)
    data_formatted = np.array2string(data, separator=",", formatter={"float": lambda x: "%.1f" % x})
    json_request = '{{ "instances" : {} }}'.format(data_formatted)

    # Specify the prediction URL. If you are runing this notebook outside of Kubernetes cluster, you should set the Cluster IP.
    url = "http://{}-predictor-default.{}.svc.cluster.local/v1/models/{}:predict".format(name, namespace, name)

    time.sleep(60)
    response = requests.post(url, data=json_request)

    print("Prediction for the image")
    display(image)
    print(response.json())
else:
    raise Exception("Run {} failed with status {}\n".format(run_id, kfp_client.get_run(run_id=run_id).run.status))

Move the mnist notebook into the cluster

kubectl -n kubeflow-user-example-com create configmap <configmap name> --from-file kubeflow-e2e-mnist.ipynb

Create a PodDefault to allow access to Kubeflow pipelines

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: access-ml-pipeline
  namespace: kubeflow-user-example-com
spec:
  desc: Allow access to Kubeflow Pipelines
  selector:
    matchLabels:
      access-ml-pipeline: "true"
  env:
    - ## this environment variable is automatically read by `kfp.Client()`
      ## this is the default value, but we show it here for clarity
      name: KF_PIPELINES_SA_TOKEN_PATH
      value: /var/run/secrets/kubeflow/pipelines/token
  volumes:
    - name: volume-kf-pipeline-token
      projected:
        sources:
          - serviceAccountToken:
              path: token
              expirationSeconds: 7200
              ## defined by the `TOKEN_REVIEW_AUDIENCE` environment variable on the `ml-pipeline` deployment
              audience: pipelines.kubeflow.org
  volumeMounts:
    - mountPath: /var/run/secrets/kubeflow/pipelines
      name: volume-kf-pipeline-token
      readOnly: true

Run the notebook programmatically using a Kubernetes resource Job or Notebook

apiVersion: batch/v1
kind: Job
metadata:
  name: test-notebook-job
  namespace: kubeflow-user-example-com
spec:
  backoffLimit: 1
  activeDeadlineSeconds: 1200
  template:
    metadata:
      labels:
        access-ml-pipeline: "true"
    spec:
      restartPolicy: Never
      initContainers:
      - name: copy-notebook
        image: busybox
        command: ['sh', '-c', 'cp /scripts/* /etc/kubeflow-e2e/']
        volumeMounts:
          - name: e2e-test
            mountPath: /scripts
          - name: kubeflow-e2e
            mountPath: /etc/kubeflow-e2e
      containers:
        - image: kubeflownotebookswg/jupyter-scipy:v1.6.1
          imagePullPolicy: IfNotPresent
          name: execute-notebook
          command:
            - /bin/sh
            - -c
            - |
              jupyter nbconvert --to notebook --execute /etc/kubeflow-e2e/kubeflow-e2e-mnist.ipynb;
              x=$(echo $?); curl -fsI -X POST http://localhost:15020/quitquitquit && exit $x;
          volumeMounts:
            - name: kubeflow-e2e
              mountPath: /etc/kubeflow-e2e
      serviceAccountName: default-editor
      volumes:
        - name: e2e-test
          configMap:
            name: e2e-test
        - name: kubeflow-e2e
          emptyDir: {}

Verify Job succeeded or failed

kubectl -n kubeflow-user-example-com wait --for=condition=complete --timeout=1200s job/test-notebook-job

Log and Report Errors

Report logs generated in the EC2 instance back to GitHub actions for users.

For failures in the workflow steps, generate inspect logs, pod logs, and describe logs. Copy the generated logs back to the GitHub Actions system and use actions/upload-artifact@v2 to allow users to access the logs when necessary.

Note: As default, artifacts are retained for 90 days. The number of retention days is configurable.

Clean Up

Regardless of the success or failure of the workflow, at the end of the workflow, the EC2 instance is deleted to ensure there are no resources left behind.

Debugging

To debug any failed step of the GitHub Actions workflow, debugging with ssh or other similar tools can be used to ssh into the GitHub system. In the GitHub system, juju can be used to connect to an AWS EC2 instance.

Notes:

GitHub secrets are limited to the Manifest repo and do not cascade to forked repositories. To debug, users must set up their own AWS secrets.
To debug the AWS EC2 instance without ssh into the GitHub system, you must have access to AWS credentials. Access to AWS credentials is limited to Manifest WG approvers.

Proof of Concept Workflow

The POC code is available with examples of both successful and failed runs.

The proposed end-to-end workflow has been tested with the following Kubernetes and Kubeflow versions

1.22 Kubernetes and 1.6.1 Kubeflow release (microk8s)
1.24 Kubernetes and main branch of the manifest repo (last commit) ( microk8s)
1.25 Kuberentes and main branch of the manifest repo (last commit) ( kind)

Alternative solutions considered

Prow

While there are some existing tests with Prow, those tests were discarded due to them not having been updated in 2 years and there being a high amount of complexity in these tests. After some investigation, the Manifests Working Group decided that it would be more work adapting those tests to the current state of manifests than starting from scratch with lower complexity.

Self-hosted runners

Self-hosted runners are not recommended with public repositories due to security concerns with how it behaves on a pull request made by a forked repository.

MicroK8s

Instead of KinD, microk8s was considered as an alternative to install Kubernetes.

Below shows the steps required in the workflow to install microk8s and to install Kubernetes using microk8s. During the Kubernetes installation, you must enable dns, storage, ingress, loadbalancer, and rbac.

# Install microk8s
sudo snap install microk8s --classic --channel ${{ matrix.microk8s }}
sudo apt update
sudo usermod -a -G microk8s ubuntu

# Install dependencies - kubectl
sudo snap alias microk8s.kubectl kubectl

# Deploy kubernetes using microk8s
sudo snap install microk8s --classic --channel 1.24/stable
microk8s enable dns hostpath-storage ingress metallb:10.64.140.43-10.64.140.49 rbac

Note: microk8s requires IP address pool when enabling dns, address pool of 10.64.140.43-10.64.140.49 is an arbitrary decision.

16 KiB Raw Blame History