If Spark is turned ON via Application Management (requiring Admin rights) you are now ready to create a Project that will run Spark.
Existing Projects using Spark can be found using the Projects sidebar menu.
Create a Spark Project
If you have not yet created a Project to use Spark then click Create New Project.
Enter your Project Name. It is recommended that this be meaningful, i.e. include a reference to "Spark", so that it is easily identifiable from the Projects list.
Select Spark as the Application to use in this Project.
Select the Host Group for Spark to use.
Select the Spark Master host.
Note - as this cannot be the same host as that used as the Kubernetes master then the Kubernetes Master host is EXCLUDED from this list.
Select the number of Workers.
Note - as the host used as the Kubernetes master also cannot be used as a Worker the maximum number equals total hosts in the host group - 1.
Click Create Project.
The Project will be created and the Host Groups tab displayed.
The Applications tab allows access to both the Spark UI and Jupyter Notebook UI. Coming soon...
At present, once created, it is not possible to resize your Spark Host Group dynamically. This requires both Project and Host Group be deleted and recreation using a resized Host Group; one with either more or less Hosts.
The following information is required in order to run and monitor Spark jobs on your Spark Instance.
Kubernetes (Cluster) Master Username, Password, and IP Address
USERNAME = Linux sudo username (provided during Connect Linux Host Nodes)
PASSWORD = password for USERNAME
CLUSTER_MASTER_IP = the Master Internal IP Address provided during Add Linux Host for the Host Group selected during Spark Instance creation (above). It is available from the Connect button on Host Group.
To access the Cluster Master:
Spark Master IP Address
SPARK_MASTER_IP = IP address of the Spark Master host selected above.
The default in SparkConf for spark.network.timeout is 120s.
The recommended value is 10000000s.
This can be changed permanently by editing SparkConf or can be changed at run-time using, for example:
$SPARK_HOME/bin/spark-submit --conf spark.network.timeout 10000000 --class myclass.neuralnet.TrainNetSpark --master spark://master.cluster:7077 --driver-memory 30G --executor-memory 14G --num-executors 7 --executor-cores 8 --conf spark.driver.maxResultSize=4g --conf spark.executor.heartbeatInterval=10000000 path/to/my.jar
It should also be noted that despite setting, for example --total-executor-cores 30, if the Kazuhm allocated cores are only 6, then only 6 will be used and no errors are displayed.
Spark Master Container Name (Pod)
SPARK_MASTER_CONTAINER_NAME available from the Cluster Master using command:
kubectl get pods
Jupyter Container Name (Pod)
JUPYTER_CONTAINER_NAME available from the Cluster Master using command:
kubectl get pods
Spark is installed at /opt/spark/ in the Spark Master Container.
Spark Master Services
More information on Spark services is obtained using command:
kubectl get services
Spark Master URL
SPARK_MASTER_URL_ON_WEBUI available from the Spark (Web) UI as detailed below.
Monitoring with Spark (Web) UI
Spark UI is needed to see the spark cluster information. Every SparkContext launches a web UI that displays useful information about the application. This includes:
- A list of scheduler stages and tasks.
- A summary of RDD sizes and memory usage.
- Environmental information, including URL: SPARK_MASTER_URL_ON_WEBUI
- Information about the running executors.
This will require a secure tunnel between your local machine and the Spark Master using command:
ssh -N -L localhost:30085:SPARK_MASTER_IP:30080 USERNAME@SPARK_MASTER_IP
and then on your browser:
Kazuhm does NOT use stand-alone cluster mode and therefore REST URL is unused.
Refresh will update all information.
Submitting Jobs (using Jupyter Terminal)
Jupyter is automatically installed into your Spark cluster.
As with the Spark GUI this will require a secure tunnel between your local machine and the Spark Master using command:
ssh -N -L localhost:30088:SPARK_MASTER_IP:30081 USERNAME@SPARK_MASTER_IP
and then on your browser:
Submitting Jobs (using SSH)
In order to submit jobs:
kubectl exec -it JUPYTER_CONTAINER_NAME sh
Jobs are submitted on the Jupyter Pod NOT the Spark Master Pod.
From the #prompt you can now submit jobs. For example, a sample job to calculate Pi:
run-example --master SPARK_MASTER_URL_ON_WEBUI SparkPi 10
(where 10 = configurable number of data points used)
Delete Spark Project
A Delete option is provided that invokes a warning as any deletion CANNOT be undone and will immediately terminate all work in progress.
A message will display once the instance has been successfully removed.