jupyterhub-keycloak

This demo showcases the integration between JupyterHub and Keycloak deployed on the Stackable Data Platform (SDP) onto a Kubernetes cluster. JupyterLab is deployed using the pyspark-notebook stack provided by the Jupyter community. A simple notebook is provided that shows how to start a distributed Spark cluster, reading and writing data from an S3 instance.

For this demo a small sample of gas sensor measurements* is provided. Install this demo on an existing Kubernetes cluster:

$ stackablectl demo install jupyterhub-keycloak

When running a distributed Spark cluster from within a JupyterHub notebook, the notebook acts as the driver and requests executors Pods from k8s. These Pods in turn can mount all volumes and Secrets in that namespace. To prevent this from breaking user separation, it is planned to use an OPA gatekeeper to define OPA rules that restrict what the created executor Pods can mount. This is not yet implemented in this demo.

System requirements

To run this demo, your system needs at least:

8 cpu units (core/hyperthread)
32GiB memory

You may need more resources depending on how many concurrent users are logged in, and which notebook profiles they are using.

Aim / Context

This demo shows how to authenticate JupyerHub users against a Keycloak backend using JupyterHub’s OAuthenticator. The same users as in the End-to-end-security demo are configured in Keycloak and these will be used as examples. The notebook offers a simple template for using Spark to interact with S3 as a storage backend.

Overview

This demo will:

Install the required Stackable Data Platform operators
Spin up the following data products:
- JupyterHub: A multi-user server for Jupyter notebooks
- Keycloak: An identity and access management product
- S3: A Minio instance for data storage
Download a sample of the gas sensor dataset into S3
Install the Jupyter notebook
Demonstrate some basic data operations against S3
Illustrate multi-user usage

JupyterHub

Have a look at the available Pods before logging in:

$ kubectl get pods
NAME                         READY   STATUS      RESTARTS   AGE
hub-84f49ccbd7-29h7j         1/1     Running     0          56m
keycloak-544d757f57-f55kr    2/2     Running     0          57m
load-gas-data-m6z5p          0/1     Completed   0          54m
minio-5486d7584f-x2jn8       1/1     Running     0          57m
proxy-648bf7f45b-62vqg       1/1     Running     0          56m

The proxy Pod has an associated proxy-public service with a statically-defined port (31095), exposed with type NodePort. The keycloak Pod has a Service called keycloak with a fixed port (31093) of type NodePort as well. In order to reach the JupyterHub web interface, navigate to this service. The node port IP can be found in the ConfigMap keycloak-address (written by the Keycloak Deployment as it starts up). On Kind this can be any node - not necessarily the one where the proxy Pod is running. This is due to the way in which Docker networking is used within the cluster. On other clusters it will be necessary to use the exact Node on which the proxy is running.

In the example below that would then be 172.19.0.5:31095:

apiVersion: v1
data:
  keycloakAddress: 172.19.0.5:31093 # Keycloak itself
  keycloakNodeIp: 172.19.0.5 # can be used to access the proxy-public service
kind: ConfigMap
metadata:
  name: keycloak-address
  namespace: default

The hub Pod may show a CreateContainerConfigError for a few moments on start-up as it requires the ConfigMap written by the Keycloak deployment.

You should see the JupyterHub login page, which will indicate a re-direct to the OAuth service (Keycloak):

Click on the sign-in button. You will be redirected to the Keycloak login, where you can enter one of the aforementioned users (e.g. justin.martin or isla.williams: the password is the same as the username):

A successful login will redirect you back to JupyterHub where different profiles are listed (the drop-down options are visible when you click on the respective fields):

The explorer window on the left includes a notebook that is already mounted.

Double-click on the file notebook/process-s3.ipynb:

Run the notebook by selecting "Run All Cells" from the menu:

The notebook includes some comments regarding image compatibility and uses a custom image built off the official Spark image that matches the Spark version used in the notebook. The java versions also match exactly. Python versions need to match at the major:minor level, which is why Python 3.11 is used in the custom image.

Once the spark executor has been started (we have specified spark.executor.instances = 1) it will spin up as an extra pod. We have named the spark job to incorporate the current user (justin-martin). JupyterHub has started a pod for the user’s notebook instance (jupyter-justin-martin---bdd3b4a1) and another one for the spark executor (process-s3-jupyter-justin-martin-bdd3b4a1-9e9da995473f481f-exec-1):

$ kubectl get pods
NAME                                   READY   STATUS      RESTARTS   AGE
...
jupyter-justin-martin---bdd3b4a1       1/1     Running     0          17m
process-s3-jupyter-justin-martin-...   1/1     Running     0          2m9s
...

Stop the kernel in the notebook (which will shut down the spark session and thus the executor) and log out as the current user. Log in now as daniel.king and then again as isla.williams (you may need to do this in a clean browser sessions so that existing login cookies are removed). This user has been defined as an admin user in the jupyterhub configuration:

  ...
  hub:
    config:
      Authenticator:
        # don't filter here: delegate to Keycloak
        allow_all: True
        admin_users:
          - isla.williams
  ...

You should now see user-specific pods for all three users:

$ kubectl get pods
NAME                               READY   STATUS      RESTARTS   AGE
...
jupyter-daniel-king---181a80ce     1/1     Running     0          6m17s
jupyter-isla-williams---14730816   1/1     Running     0          4m50s
jupyter-justin-martin---bdd3b4a1   1/1     Running     0          3h47m
...

The admin user (isla.williams) will also have an extra Admin tab in the JupyterHub console where current users can be managed. You can find this in the JupyterHub UI at http://<ip>:31095/hub/admin e.g http://172.19.0.5:31095/hub/admin:

You can inspect the S3 buckets by using stackable stacklet list to return the Minio endpoint and logging in there with admin/adminadmin:

$ stackablectl stacklet list

┌─────────┬───────────────┬───────────┬───────────────────────────────┬────────────┐
│ PRODUCT ┆ NAME          ┆ NAMESPACE ┆ ENDPOINTS                     ┆ CONDITIONS │
╞═════════╪═══════════════╪═══════════╪═══════════════════════════════╪════════════╡
│ minio   ┆ minio-console ┆ default   ┆ http  http://172.19.0.5:32470 ┆            │
└─────────┴───────────────┴───────────┴───────────────────────────────┴────────────┘

if you attempt to re-run the notebook you will need to first remove the _temporary folders from the S3 buckets. These are created by spark jobs and are not removed from the bucket when the job has completed.

*See: Burgués, Javier, Juan Manuel Jiménez-Soto, and Santiago Marco. "Estimation of the limit of detection in semiconductor gas sensors through linearized calibration models." Analytica chimica acta 1013 (2018): 13-25 Burgués, Javier, and Santiago Marco. "Multivariate estimation of the limit of detection by orthogonal partial least squares in temperature-modulated MOX sensors." Analytica chimica acta 1019 (2018): 49-64.