Running Spark Drivers with Telepresence

This is a guest post from Nate Buesgens, a Full Stack Engineer at Vizual.AI.  

Spark is a tool for large scale data processing. At vizual.ai, Spark is a critical tool for analyzing and modeling clickstream data, and Kubernetes has been very effective at orchestrating our Spark cluster. Coordinating the communications of the various Spark components during development, or while performing ad-hoc analyses can be non-trivial. We have used Telepresence to make the Spark application development workflow more efficient and to generally improve our access to infrastructure running on our private Kubernetes network.

Spark

The Spark driver houses the business logic of your application. The driver will run on the same network as your Spark master and worker nodes without too much trouble. But, you may run into several scenarios where it can also be convenient to execute a driver locally.
– This can generally improve the development workflow when integration testing.
– You may not want to expose your Spark master node to the internet to accept driver applications to run in cluster mode.
– Some Spark applications must be run in client mode – for example, pyspark applications running on a “standalone” Spark cluster.

Telepresence

We need our local Spark driver to coordinate with our Spark cluster on a private Kubernetes network. In lieu of a full fledged VPN (and the associated operational costs), Telepresence can be used to proxy these communications for us. If your driver runs inside of a Docker container, the docker support features of Telepresence make this process seamless.

telepresence --expose 4040 --expose 5050 --expose 6060 --docker-run \
-v $HOME/.kube:/root/.kube \
-i -t driver.sh
  • Note that in addition to the arguments listed above, you may also want to mount your source as a volume so that it does not need to be built into the image. You will likely also want to inject relevant environment variables into the container – for example, the address of the master Spark service if that had not been hard coded.
  • You will need to expose three static ports. Here I use 4040, 5050, and 6060. The Spark driver must accept inbound communications for the Spark application UI, the block manager, and the driver itself.
  • Finally, your container will need the kubectl CLI and your cluster credentials mounted in order to programmatically access the IP of the Telepresence proxy as described below.

With the container configured by Telepresence to proxy communications, the driver must also be configured with the address of the Telepresence proxy so that the Spark executors route traffic correctly. Without this configuration the driver will stall waiting for executors to take up tasks, while the executors will fail to open a connection to your container’s local address.

# driver.sh

if [[ "$TELEPRESENCE_POD" ]]; then
POD_IP=$(kubectl get pod $TELEPRESENCE_POD -o jsonpath={.status.podIP})
NET_ARGS=" \
--conf spark.driver.host=$POD_IP \
--conf spark.driver.bindAddress=0.0.0.0 \
--conf spark.driver.port=5050 \
--conf spark.blockManager.port=6060"
fi

$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
$NET_ARGS \
$APP
  • spark.driver.host is the IP that the executors will attempt to connect to. This must be set to the IP of the Telepresence pod.
  • spark.driver.bindAddress will ensure that the driver continues to attempt to connect to the local interface, and not to the IP of the Telepresence pod.
  • spark.driver.port and spark.blockManager.port are random by default, but must be configured to be static to ensure they are proxied by Telepresence.

With the above configuration, Telepresence can be used to drastically improve the development workflow for Spark applications.