Data Cleansing in OpenShift.

In this workshop, we’ll be learning how to perform basic Data Science and Machine Learning using OpenShift. The OpenShift platform greatly simplifies deploying and using the tools that are typically used in most Data Science workflows today. In particular, we’ll be using the Open Data Hub project to deploy most of the infrastructure we’ll be using in the lab today.

Introduction to Open Data Hub

Open Data Hub (ODH) is an open source project based on Kubeflow that provides open source AI tools for running large and distributed AI workloads on OpenShift Container Platform. Currently, the Open Data Hub project provides open source tools for data storage, distributed AI and Machine Learning (ML) workflows, Jupyter Notebook development environment and monitoring. The Open Data Hub project roadmap offers a view on new tools and integration the project developers are planning to add.

Open Data Hub includes several open source components, which can be individially enabled. They include:

Apache Airflow
Apache Kafka
Apache Spark
Apache Superset
Argo
Grafana
JupyterHub
Prometheus
Seldon

Open Data Hub itself is deployed on OpenShift via an Operator. Operators on OpenShift are responsible for managing Custom Resources. Custom resources allow for OpenShift users to create higher level objects in Kubernetes than are available via the basic Kubernetes API. For example, the AMQ Streams Operator defines a custom resource called "Kafka", which allows us to create a Kafka cluster. Operators can define more than one Custom Resource - the AMQ Streams Operator also allows for the creation of a "KafkaTopic" on our "Kafka" cluster.

Data Cleansing

The first step in developing a model for machine learning is often cleansing the data on which we’re going to operate. In this section of the workshop, we’ll learn how to perform basic data cleansing using a set of data available on the Internet. We’ll download the data in CSV format and use the Python Pandas library to clean and visualize the data inside of a Jupyter notebook.

Jupyter notebooks are widely used throughout the data science community to organize, run, and share code. Jupyter supports many programming languages, but the most common languages used in data science are Python and R. This workshops examples will all be in Python, as will the code we use to train and run our model.

In this first exercise, we’ll be deploying a Jupyter Hub instance for our user on OpenShift and loading a notebook that performs some basic data cleansing and visualization exercizes on COVID-19 data from the Georgia Department of Health.

Deploy Open Data Hub

We’ll be using Open Data Hub to deploy JupyterHub and Ceph Nano. The Open Data Hub operator is already installed for you, but you’ll need to deploy an instance of it into your project.

To deploy ODH, first open a new brower with the {{ CONSOLE_URL }}[OpenShift web console^]:

Username: {{ USER_ID }}
Password: {{ OPENSHIFT_USER_PASSWORD }}

When you first log in, you’ll be presented with a Welcome dialog.

Go ahead and select "Get started" to get a tour of the user interface.

After you finish the tour, you will see a list of projects to which you have access:

Select "{{ USER_ID }}-notebooks" to enter the namespace for the workshop.

To deploy an instance of ODH, click on *All Services" under "Developer Catalog" on the Add page.

Type in datahub in the search box, and click on the Open Data Hub:

Click on Create. This will bring you to a screen which allows you to customize the components that you’d like to install.

Click on YAML View. Replace the YAML in the edit box with the following definition:

apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  name: opendatahub
spec:
  applications:
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: odh-common
      name: odh-common
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: grafana/cluster
      name: grafana-cluster
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: grafana/grafana
      name: grafana-instance
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: prometheus/cluster
      name: prometheus-cluster
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: prometheus/operator
      name: prometheus-operator
    - kustomizeConfig:
        parameters:
          - name: s3_endpoint_url
            value: 'http://ceph-nano-0'
        repoRef:
          name: manifests
          path: jupyterhub/jupyterhub
      name: jupyterhub
    - kustomizeConfig:
        overlays:
          - additional
        repoRef:
          name: manifests
          path: jupyterhub/notebook-images
      name: notebook-images
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: ceph/object-storage/scc
      name: ceph-nano-scc
    # Deploy ceph-nano for minimal object storage running in a pod
    - kustomizeConfig:
        repoRef:
          name: manifests
          path: ceph/object-storage/nano
      name: ceph-nano
    - kustomizeConfig:
        overlays:
          - authentication
        repoRef:
          name: manifests
          path: odh-dashboard
      name: odh-dashboard
  repos:
    - name: kf-manifests
      uri: >-
        https://github.com/opendatahub-io/manifests/tarball/v1.5-branch-openshift
    - name: manifests
      uri: 'https://github.com/opendatahub-io/odh-manifests/tarball/v1.3'

Click Create. On the topology view, you should see the Open Data Hub Components being deployed.

Note: the little green square in the lower left of the diagram calls out the "arrange" function in the topology view. It might help arrange the topology more neatly.

Getting Started with JupyterHub

Connect to JupyterHub

In Open Data Hub, JupyterHub is deployed with a proxy. To connect to it, we’ll find the route for the proxy and open it in a new tab.

Click on the "traefik-proxy" icon and look for the "Routes" section.

Just click on this link, a new tab will open. Click on the button Sign in with OpenShift, and use your OpenShift credentials to connect.

Username: {{ USER_ID }}
Password: {{ OPENSHIFT_USER_PASSWORD }}

On the first connection, OpenShift will ask to authorize the application to access your username just click Allow selected permissions.

Launch Jupyter

Once you log into JupyterHub, you’ll be asked to enter your S3 credentials. When we deployed Open Data Hub, we asked it to provision an instance of "ceph-nano" for us to use. The credentials for "ceph-nano" are stored in our namespace in a Kubernetes Secret.

To find these values, return to the {{ CONSOLE_URL }}/topology/ns/{{ USER_ID }}-notebooks[Topology View^]. Click on Secrets on the left navigation. Then look for ceph-nano-credentials on the list of secrets and click on it. Scroll to the "Data" section and select Reveal Values.

Warning

Do not click the "START" button until after selecting the Notebook Image and Container Size, the next step after the AWS credentials. It is tempting and easy to do. At least one of your workshop leaders has done it.

We’ll be adding these values as Environment Variables before we spawn our Notebook Image. To do that, click on "Add more variables" under the "Environment variables" section.

Select Custom variable and enter the AWS_ACCESS_KEY_ID for the name and value from the ceph-nano-credentials secret.

Add another Custom variable for the AWS_SECRET_ACCESS_KEY.

On the Notebook image section, select the Tensorflow Notebook Image image from the list.

Next, select Container size: Medium and click Start at the bottom. We’ll be using Tensorflow to generate the model in our next step.

Your Jupyter environment will take 15-20 seconds to launch.

Once your environment is up and running, select "Terminal" from the "New" menu to launch a terminal.

Check out code

We’ll start out by cloning the git repository that we’ll be using during this lab. The repository contains the notebooks and the code for creating and deploying our model. In the terminal, run the following command:

git clone https://github.com/Red-Hat-SE-RTO/machine-learning-workshop-labs/

IMPORTANT: Check out proper Git branch

To make sure you’re using the right version of the project files, run this command in the terminal:

cd ~/machine-learning-workshop-labs && git checkout main

The results will look like this if you are successful.

Run the data cleansing notebook.

Navigate back to the jupyterhub Home Page tab. You should see the labs in the interface now:

Click on the "machine-learning-workshop-labs" folder and click on the "notebooks" folder.

Click on "georgia-covid19.ipynb" to open up the data cleansing notebook.

Run through the notebook using the "Run" button on the main toolbar:

The notebook will guide you through downloading the data from the COVID Tracking Project, visualizing it, and cleansing it.

Once you’ve finished cleansing the data, we’re ready to move on to creating a model to diagnose illness from image files.

Congratulations! You have run the notebook and are ready for the next chapter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Cleansing in OpenShift.

Introduction to Open Data Hub

Data Cleansing

Deploy Open Data Hub

Getting Started with JupyterHub

Connect to JupyterHub

Launch Jupyter

Check out code

IMPORTANT: Check out proper Git branch

Run the data cleansing notebook.

FilesExpand file tree

data-cleansing.adoc

Latest commit

History

data-cleansing.adoc

File metadata and controls

Data Cleansing in OpenShift.

Introduction to Open Data Hub

Data Cleansing

Deploy Open Data Hub

Getting Started with JupyterHub

Connect to JupyterHub

Launch Jupyter

Check out code

IMPORTANT: Check out proper Git branch

Run the data cleansing notebook.