In this workshop, we’ll be learning how to perform basic Data Science and Machine Learning using OpenShift. The OpenShift platform greatly simplifies deploying and using the tools that are typically used in most Data Science workflows today. In particular, we’ll be using the Open Data Hub project to deploy most of the infrastructure we’ll be using in the lab today.
Open Data Hub (ODH) is an open source project based on Kubeflow that provides open source AI tools for running large and distributed AI workloads on OpenShift Container Platform. Currently, the Open Data Hub project provides open source tools for data storage, distributed AI and Machine Learning (ML) workflows, Jupyter Notebook development environment and monitoring. The Open Data Hub project roadmap offers a view on new tools and integration the project developers are planning to add.
Open Data Hub includes several open source components, which can be individially enabled. They include:
-
Apache Airflow
-
Apache Kafka
-
Apache Spark
-
Apache Superset
-
Argo
-
Grafana
-
JupyterHub
-
Prometheus
-
Seldon
Open Data Hub itself is deployed on OpenShift via an Operator. Operators on OpenShift are responsible for managing Custom Resources. Custom resources allow for OpenShift users to create higher level objects in Kubernetes than are available via the basic Kubernetes API. For example, the AMQ Streams Operator defines a custom resource called "Kafka", which allows us to create a Kafka cluster. Operators can define more than one Custom Resource - the AMQ Streams Operator also allows for the creation of a "KafkaTopic" on our "Kafka" cluster.
The first step in developing a model for machine learning is often cleansing the data on which we’re going to operate. In this section of the workshop, we’ll learn how to perform basic data cleansing using a set of data available on the Internet. We’ll download the data in CSV format and use the Python Pandas library to clean and visualize the data inside of a Jupyter notebook.
Jupyter notebooks are widely used throughout the data science community to organize, run, and share code. Jupyter supports many programming languages, but the most common languages used in data science are Python and R. This workshops examples will all be in Python, as will the code we use to train and run our model.
In this first exercise, we’ll be deploying a Jupyter Hub instance for our user on OpenShift and loading a notebook that performs some basic data cleansing and visualization exercizes on COVID-19 data from the Georgia Department of Health.
We’ll be using Open Data Hub to deploy JupyterHub and Ceph Nano. The Open Data Hub operator is already installed for you, but you’ll need to deploy an instance of it into your project.
To deploy ODH, first open a new brower with the {{ CONSOLE_URL }}[OpenShift web console^]:
Login using:
-
Username:
{{ USER_ID }} -
Password:
{{ OPENSHIFT_USER_PASSWORD }}
When you first log in, you’ll be presented with a Welcome dialog.
Go ahead and select "Get started" to get a tour of the user interface.
After you finish the tour, you will see a list of projects to which you have access:
Select "{{ USER_ID }}-notebooks" to enter the namespace for the workshop.
To deploy an instance of ODH, click on *All Services" under "Developer Catalog" on the Add page.
Type in datahub in the search box, and click on the Open Data Hub:
Click on Create. This will bring you to a screen which allows you to customize the components that you’d like to install.
Click on YAML View. Replace the YAML in the edit box with the following definition:
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
name: opendatahub
spec:
applications:
- kustomizeConfig:
repoRef:
name: manifests
path: odh-common
name: odh-common
- kustomizeConfig:
repoRef:
name: manifests
path: grafana/cluster
name: grafana-cluster
- kustomizeConfig:
repoRef:
name: manifests
path: grafana/grafana
name: grafana-instance
- kustomizeConfig:
repoRef:
name: manifests
path: prometheus/cluster
name: prometheus-cluster
- kustomizeConfig:
repoRef:
name: manifests
path: prometheus/operator
name: prometheus-operator
- kustomizeConfig:
parameters:
- name: s3_endpoint_url
value: 'http://ceph-nano-0'
repoRef:
name: manifests
path: jupyterhub/jupyterhub
name: jupyterhub
- kustomizeConfig:
overlays:
- additional
repoRef:
name: manifests
path: jupyterhub/notebook-images
name: notebook-images
- kustomizeConfig:
repoRef:
name: manifests
path: ceph/object-storage/scc
name: ceph-nano-scc
# Deploy ceph-nano for minimal object storage running in a pod
- kustomizeConfig:
repoRef:
name: manifests
path: ceph/object-storage/nano
name: ceph-nano
- kustomizeConfig:
overlays:
- authentication
repoRef:
name: manifests
path: odh-dashboard
name: odh-dashboard
repos:
- name: kf-manifests
uri: >-
https://github.com/opendatahub-io/manifests/tarball/v1.5-branch-openshift
- name: manifests
uri: 'https://github.com/opendatahub-io/odh-manifests/tarball/v1.3'Click Create. On the topology view, you should see the Open Data Hub Components being deployed.
Note: the little green square in the lower left of the diagram calls out the "arrange" function in the topology view. It might help arrange the topology more neatly.
In Open Data Hub, JupyterHub is deployed with a proxy. To connect to it, we’ll find the route for the proxy and open it in a new tab.
Click on the "traefik-proxy" icon and look for the "Routes" section.
Just click on this link, a new tab will open. Click on the button Sign in with OpenShift, and use your OpenShift credentials to connect.
-
Username:
{{ USER_ID }} -
Password:
{{ OPENSHIFT_USER_PASSWORD }}
On the first connection, OpenShift will ask to authorize the application to access your username just click Allow selected permissions.
Once you log into JupyterHub, you’ll be asked to enter your S3 credentials. When we deployed Open Data Hub, we asked it to provision an instance of "ceph-nano" for us to use. The credentials for "ceph-nano" are stored in our namespace in a Kubernetes Secret.
To find these values, return to the {{ CONSOLE_URL }}/topology/ns/{{ USER_ID }}-notebooks[Topology View^]. Click on Secrets on the left navigation. Then look for ceph-nano-credentials on the list of secrets and click on it. Scroll to the "Data" section and select Reveal Values.
|
Warning
|
Do not click the "START" button until after selecting the Notebook Image and Container Size, the next step after the AWS credentials. It is tempting and easy to do. At least one of your workshop leaders has done it. |
We’ll be adding these values as Environment Variables before we spawn our Notebook Image. To do that, click on "Add more variables" under the "Environment variables" section.
Select Custom variable and enter the AWS_ACCESS_KEY_ID for the name and value from the ceph-nano-credentials secret.
Add another Custom variable for the AWS_SECRET_ACCESS_KEY.
On the Notebook image section, select the Tensorflow Notebook Image image from the list.
Next, select Container size: Medium and click Start at the bottom. We’ll be using Tensorflow to generate the model in our next step.
Your Jupyter environment will take 15-20 seconds to launch.
Once your environment is up and running, select "Terminal" from the "New" menu to launch a terminal.
We’ll start out by cloning the git repository that we’ll be using during this lab. The repository contains the notebooks and the code for creating and deploying our model. In the terminal, run the following command:
git clone https://github.com/Red-Hat-SE-RTO/machine-learning-workshop-labs/Navigate back to the jupyterhub Home Page tab. You should see the labs in the interface now:
Click on the "machine-learning-workshop-labs" folder and click on the "notebooks" folder.
Click on "georgia-covid19.ipynb" to open up the data cleansing notebook.
Run through the notebook using the "Run" button on the main toolbar:
The notebook will guide you through downloading the data from the COVID Tracking Project, visualizing it, and cleansing it.
Once you’ve finished cleansing the data, we’re ready to move on to creating a model to diagnose illness from image files.
Congratulations! You have run the notebook and are ready for the next chapter.
















