Spark Usage Instructions

Spark Basic Usage

cd $SPARK_HOME
./bin/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster \
    --master yarn lib/spark-examples-1.4.1-hadoop2.4.0.jar 10

pyspark

When using pyspark for submitting tasks, the worker may be unable to find the python lib in the spark framework. The following settings may solve the problem:

conf.set('spark.yarn.dist.files','file://$SPARK_HOME/python/lib/pyspark.zip,file://$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip')
conf.setExecutorEnv('PYTHONPATH','pyspark.zip:py4j-0.8.2.1-src.zip')

> Notice: the real package name might be different from the above example, remember to replace the above package name with the real package name

View History

You can view history through the spark history Server UI and easily locate issues. Typically, after you have set up a socks5 ssh tunnel, you will be redirected to the Spark UI after clicking on the job link, but in some cases this might fail. In this case, you can try to access http://${history_ip}:${history_port} and open the appId to find the corresponding history UI.

Description Port
history UI 18900

NOTE: To obtain the above ${history_ip}, refer to set up socks5 agent Step 4.