Deploying Apache Spark® on RKS

02 Mar 2023

Apache Spark®  is an open-source, distributed processing system used for processing large amounts of data. When Spark is deployed on Kubernetes, applications that are submitted cause Spark to create a Spark driver running within a Kubernetes pod. The driver creates executors which will also run within Kubernetes pods. The driver and executor pod scheduling is handled by Kubernetes.

Prerequisites

  • An account on the Ridge cloud
  • Kubectl, the command line tool for Kubernetes
  • helm the Kubernetes package manager

Creating a Kubernetes Cluster

For a step-by-step guide, visit our guide on how to create a Kubernetes cluster.

Once you are successful, continue here:

Now that our cluster is ready and we have downloaded the kubeconfig file, let’s verify that our cluster is up and running. To verify that you have a running master and worker node,  list the nodes by executing:

If everything works, you should something similar to this:

Now we are ready to deploy Confluent®.

Deploying Spark® with Helm

In this tutorial we will use A helm chart to deploy a Sparc cluster. First, add the Bitnami repo:

Create a new namespace named spark:

Install Spark:

The output will look like this:

** IMPORTANT: When submitting an application the –master parameter should be set to the service IP, if not, the application will not resolve the master. **

Wait a while and then verify that both worker nodes and master nodes are working. Notice the service with the external IP address, You will use this address to access Spark.

Access the web console  using the external IP of my-release-spark-master-svc (72.251.231.92 in this example) 

 

Example Application

The example is a calculation of  an estimate of 𝞹. It is a very simple example that makes use of a single worker.

Explanation of the Application

The unit circle has an area equal to 𝞹. The square encasing it  has an area of 4. Any point within the circle satisfies  x2 +y2 1 .
x2
The application generates random points and counts the number that fall within the circle. When the number of points is large enough 
points within the cirlcetotal number of points ->  𝞹

(The application actually random points where 0x1 and 0y1 but this is equivalent)

Submitting the Application

Set an environment variable the IP where the application will be submitted:

Set an environment to the name of the example JAR file that exists on the worker node.

Submit the application:

Note that 1000 is the number of randomly generated points that will be used for the calculation.

The output will look like this:

In the web console you can now see the submitted application:

Application Output

The output of the application is stored in the worker that performed the work.  To find the worker, look at the Completed Drivers section. Note that the worker’s IP address is embedded in its name. In this case it is 172.28.64.2. 

Find the worker pod by executing:

Which in this example is:

The result will show the name of the worker pod:

Connect to the pod. In this example my-release-spark-worker-0:

You will get a prompt. Change directory to /opt/bitnami/spark/work:

Find the submission ID in the UI. In this example it is driver-20220117193913-0000. To see the output file execute:

The result will be the calculated value of 𝞹: