Apache Spark® is an open-source, distributed processing system used for processing large amounts of data. When Spark is deployed on Kubernetes, applications that are submitted cause Spark to create a Spark driver running within a Kubernetes pod. The driver creates executors which will also run within Kubernetes pods. The driver and executor pod scheduling is handled by Kubernetes.
Now that our cluster is ready and we have downloaded the kubeconfig file, let’s verify that our cluster is up and running. To verify that you have a running master and worker node, list the nodes by executing:
If everything works, you should something similar to this:
Now we are ready to deploy Confluent®.
Deploying Spark® with Helm
In this tutorial we will use A helm chart to deploy a Sparc cluster. First, add the Bitnami repo:
Create a new namespace named spark:
Install Spark:
The output will look like this:
** IMPORTANT: When submitting an application the –master parameter should be set to the service IP, if not, the application will not resolve the master. **
Wait a while and then verify that both worker nodes and master nodes are working. Notice the service with the external IP address, You will use this address to access Spark.
Access the web console using the external IP of my-release-spark-master-svc (72.251.231.92 in this example)
Example Application
The example is a calculation of an estimate of 𝞹. It is a very simple example that makes use of a single worker.
Explanation of the Application
The unit circle has an area equal to 𝞹. The square encasing it has an area of 4. Any point within the circle satisfies x2+y2≤ 1 . x2 The application generates random points and counts the number that fall within the circle. When the number of points is large enough points within the cirlce⁄total number of points -> 𝞹⁄4
(The application actually random points where 0≤x≤1 and 0≤y≤1 but this is equivalent)
Submitting the Application
Set an environment variable the IP where the application will be submitted:
Set an environment to the name of the example JAR file that exists on the worker node.
Submit the application:
Note that 1000 is the number of randomly generated points that will be used for the calculation.
The output will look like this:
In the web console you can now see the submitted application:
Application Output
The output of the application is stored in the worker that performed the work. To find the worker, look at the Completed Drivers section. Note that the worker’s IP address is embedded in its name. In this case it is 172.28.64.2.
Find the worker pod by executing:
Which in this example is:
The result will show the name of the worker pod:
Connect to the pod. In this example my-release-spark-worker-0:
You will get a prompt. Change directory to /opt/bitnami/spark/work:
Find the submission ID in the UI. In this example it is driver-20220117193913-0000. To see the output file execute: