Deploying Apache Spark® on RKS

Apache Spark® is an open-source, distributed processing system used for processing large amounts of data. When Spark is deployed on Kubernetes, applications that are submitted cause Spark to create a Spark driver running within a Kubernetes pod. The driver creates executors which will also run within Kubernetes pods. The driver and executor pod scheduling is handled by Kubernetes.

Prerequisites

An account on the Ridge cloud
Kubectl, the command line tool for Kubernetes
helm the Kubernetes package manager

Creating a Kubernetes Cluster

For a step-by-step guide, visit our guide on how to create a Kubernetes cluster.

Once you are successful, continue here:

Now that our cluster is ready and we have downloaded the kubeconfig file, let’s verify that our cluster is up and running. To verify that you have a running master and worker node, list the nodes by executing:

If everything works, you should something similar to this:

Now we are ready to deploy Confluent®.

Deploying Spark® with Helm

In this tutorial we will use A helm chart to deploy a Sparc cluster. First, add the Bitnami repo:

Create a new namespace named spark:

Install Spark:

The output will look like this:

** IMPORTANT: When submitting an application the –master parameter should be set to the service IP, if not, the application will not resolve the master. **

Wait a while and then verify that both worker nodes and master nodes are working. Notice the service with the external IP address, You will use this address to access Spark.

Access the web console using the external IP of my-release-spark-master-svc (72.251.231.92 in this example)

Example Application

The example is a calculation of an estimate of 𝞹. It is a very simple example that makes use of a single worker.

Explanation of the Application

The unit circle has an area equal to 𝞹. The square encasing it has an area of 4. Any point within the circle satisfies x² +y² ≤ 1 .
x²
The application generates random points and counts the number that fall within the circle. When the number of points is large enough
^{points within the cirlce}⁄_{total number of points ->} ^𝞹⁄₄

(The application actually random points where 0≤x≤1 and 0≤y≤1 but this is equivalent)

Submitting the Application

Set an environment variable the IP where the application will be submitted:

Set an environment to the name of the example JAR file that exists on the worker node.

Submit the application:

Note that 1000 is the number of randomly generated points that will be used for the calculation.

The output will look like this:

In the web console you can now see the submitted application:

Application Output

The output of the application is stored in the worker that performed the work. To find the worker, look at the Completed Drivers section. Note that the worker’s IP address is embedded in its name. In this case it is 172.28.64.2.

Find the worker pod by executing: